Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker's mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.

[1]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[2]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[3]  Trevor Darrell,et al.  Informative subspaces for audio-visual processing: High-level function from low-level fusion , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[6]  Chalapathy Neti,et al.  A real-time prototype for small-vocabulary audio-visual ASR , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Harriet J. Nock,et al.  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[9]  Jean-Philippe Thiran,et al.  Feature space mutual information in speech-video sequences , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[10]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[11]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[14]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..