Speaker identification and video analysis for hierarchical video shot classification

We present a new video shot classification and clustering technique to support content-based indexing, browsing and retrieval in video databases. The proposed method is based on the analysis of both the audio and visual data tracks. The visual stream is analyzed using a 3-D wavelet transform and segmented into shot units which are matched and clustered by visual content. Simultaneously, speaker changes are detected by tracking voiced phonemes in the audio signal. The clues obtained from the video and speech data are combined to classify and group the isolated video shots. This integrated approach also allows effective indexing of the audio-visual objects in multimedia databases.

[1]  Yihong Gong,et al.  Automatic parsing of news video , 1994, 1994 Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[2]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[3]  Minerva M. Yeung,et al.  Efficient matching and clustering of video shots , 1995, Proceedings., International Conference on Image Processing.

[4]  John H. L. Hansen,et al.  Frequency characteristics of foreign accented speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  A. Enis Çetin,et al.  Subband analysis for robust speech recognition in the presence of car noise , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jeho Nam,et al.  Combined audio and visual streams analysis for video sequence segmentation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Nilesh V. Patel,et al.  Video classification using speaker identification , 1997, Electronic Imaging.

[8]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[9]  Liming Chen,et al.  Multichannel video segmentation , 1996, Other Conferences.

[10]  Sun-Yuan Kung,et al.  Video shot classification using human faces , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[11]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[14]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .