Crowded video sequences, like those of demonstrations, offer an interesting challenge for object detection and tracking owing to their complexity: taken outdoors, often in different illumination conditions; showing faces not in frontal view, with perspective effects, complex background, etc. Tracking of individuals becomes a difficult task due to the high number of occlusions. The paper proposes a mutual feedback spatiotemporal detection strategy to tackle these problems. The system improves its efficiency thanks to a cooperative approach between spatial detection and temporal tracking. Spatial detection is based on skin colour classification and shape analysis by morphological tools. Temporal tracking is based on the analysis of the optical flow. The mutual feedback scheme benefits both spatial detection and temporal tracking. In order to deal with multiple occlusions, a graph-based tracking technique, which takes advantage of neighbourhood consistency, has been introduced.