Finding presentations in recorded meetings using audio and video features

This paper describes a method for finding segments in video-recorded meetings that correspond to presentations. These segments serve as indexes into the recorded meeting. The system automatically detects intervals of video that correspond to presentation slides. We assume that only one person speaks during an interval when slides are detected. Thus these intervals can be used as training data for a speaker spotting system. An HMM is automatically constructed and trained on the audio data from each slide interval. A Viterbi alignment then resegments the audio according to speaker. Since the same speaker may talk across multiple slide intervals the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. This allows the individual presentations in the video to be identified from the location of each presenter's speech. Results are presented for a corpus of six meeting videos.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Francine Chen,et al.  Segmentation of speech using speaker identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Don Kimber,et al.  Acoustic Segmentation for Audio Browsers , 1997 .

[4]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[5]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Andreas Girgensohn,et al.  Video classification using transform coefficients , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Robert Malkin,et al.  Experiments in automatic meeting transcription using JRTK , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Yasuo Ariki,et al.  Article Extraction and Classification of TV News Using Image and Speech Processing , 1996, CODAS.

[9]  Andreas Girgensohn,et al.  An intelligent media browser using automatic multimodal analysis , 1998, MULTIMEDIA '98.