Caption-aided speech detection in videos

This paper presents a novel audio-visual fusion method for speech detection, which is an important front-end for content-based video processing. This approach aims to extract homogeneous speech segments from the accompanying audio stream in real-world movie/TV videos with the help of video captions. Note that captions are mainly created to help viewers to follow the dialog, rather than to accurately locate the speech regions. We propose a caption-aided speech detection approach, which makes use of both caption information and audio information. The inaccurate positions of the captions are refined through using audio features (pitch and MFCCs) and BIC-based acoustic change detection. Comparison experiments against several other traditional speech detection approaches are conducted, showing that the proposed approach improves the speech detection performance greatly.

[1]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[2]  Ying Li,et al.  Content-based movie analysis and indexing based on audiovisual cues , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Xian-Sheng Hua,et al.  Automatic location of text in video frames , 2001, MULTIMEDIA '01.

[4]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[5]  Ji Wu,et al.  Fuzzy clustering and Bayesian information criterion based threshold estimation for robust voice activity detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Rongrong Wang,et al.  A novel video caption detection approach using multi-frame integration , 2004, ICPR 2004.

[7]  Rainer Lienhart,et al.  On the segmentation of text in videos , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[8]  R.A. Goubran,et al.  Pitch-based feature extraction for audio classification , 2003, The 2nd IEEE Internatioal Workshop on Haptic, Audio and Visual Environments and Their Applications, 2003. HAVE 2003. Proceedings..

[9]  Wen Gao,et al.  A fast and robust speech/music discrimination approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[10]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[11]  Qian Huang,et al.  Multimedia search and retrieval: new concepts, system implementation, and application , 2000, IEEE Trans. Circuits Syst. Video Technol..