Finding People in Archive Films through Tracking



Introduction

Archive films provide a particularly challenging setting for detecting and tracking people: the videos are low in quality and lack color, cameras move, and there are many occlusions and crowded scenes. Face detection is a good starting point for these films as people in there show their faces (most of the time).

In this work we show an interesting example of how face detection and face tracking help each other: (1) detection makes tracking easy by (mostly) operating at the object level, hence independent of changes in low-level signals; (2) tracking makes detection easy because most false positives are isolated and can be removed by inference on temporal coherence. Even simple algorithms lead to robust detection and tracking results.

Top: single-frame face detection is far from perfect, with a lot of false detections. Bottom: we track faces and integrate face scores temporally to remove false positives and recover misses.

Face Tracking by Detection

Face tracking in the archive films can be very hard: there are large variations in face pose and illumination, and (low-level) appearance-based tracking often fails at short distances. By tracking at the object (face) level, however, the algorithm avoids dealing with these variations: as long as it is a face that moves smoothly, we can track its location with no trouble.

Of course, there are many difficult situations where face detection does not work, such as when the person is facing away, in a weird angle, or when the illumination is bad. In such cases, we can switch to low-level tracking (using correlation) and continue. Often the "gaps" are short; soon their faces will re-appear, and the algorithm can continue back to the object tracking mode.

Our actual tracking algorithm is simple: use the Viola-Jones detector with a low "threshold" to find all possible faces in each frame, and track these faces with a Kalman Filter. We use a conservative strategy for initialization, i.e. each potential face starts a face track if not matching existing tracks. On the other hand we use an aggressive greedy data association strategy: we estimate how "good" the face tracks are, and allow best tracks to match candidate faces first.

Face Detection by Tracking

Tracking establishes the correspondence between candidate faces across frames. There are many false positives from the detection stage, but most false positives are isolated and form short tracks. Good faces form long tracks and they reinforce each other, increasing their likelihood of being a true face.

Enforcing temporal consistency is not a trivial issue. Most previous works use simple strategies, e.g. thresholding on the track length. There are several factors at play: (1) the face confidence score of each face detection; (2) the track length; (3) the confidence in tracking/temporal correspondence. In particular, as we are tracking all candidate faces, there are numerous tracks and we cannot assume the tracking to be 100% correct. Tracking can fail, especially in low-level tracking model over long "gaps" between detections.

A systematic way to combine these factors is to use a learned probabilistic model; in this case we use a one-dimensional conditional random field. Inference is standard and stochastic gradient descent gives parameters consistent with intuition. We compare the CRF integration model with several baselines, such as using track length, using average score, using maximum score, or using a local (short-range) SVM classifier.

Results on Full-length Films

Groundtruth faces are labeled in three full-length films (every 100th frame), used to both train the models and to evaluate (final) detection performance. The CRF model greatly improves face detection (90% average precision, 70% recall) over single-frame detection (60% precision, 58 recall). It also outperforms various baseline models for temporal integration.

(a) Casablanca Temporal vs Static (b) Casablanca CRF vs baselines
(c) Kind Hearts and Coronets (d) The Great Dictator

Here are some sample video results. The tracks are automatically selected -- showing high recall with few false positives.

Casablanca Coronets Dictator

References

  1. Finding People in Archive Films through Tracking.   [abstract] [pdf]
      Xiaofeng Ren, in CVPR '08, Anchorage 2008.

  2. Face detection and tracking in a video by propagating detection probabilities.
      R. Choudhury, C. Schmid, and K. Mikolajczyk, PAMI 25(10), 2003.