Video classification using speaker identification

Video content characterization is a challenging problem in video databases. The aim of such characterization is to generate indices that can describe a video clip in terms of objects and their actions in the clip. Generally, such indices are extracted by performing image analysis on the video clips. Many such indices can also be generated by analyzing the embedded audio information of video clips. Indices pertaining to context, scene emotion, and actors or characters present in a video clip appear especially suitable for generation via audio analysis techniques of keyword spotting, and speech and speaker recognition. In this paper, we examine the potential of speaker identification techniques for characterizing video clips in terms of actors present in them. We describe a three-stage processing system consisting of a shot boundary detection stage, an audio classification stage, and a speaker identification stage to determine the presence of different actors in isolated shots. Experimental results using the movie A Few Good Men are presented to show the efficacy of speaker identification for labeling video clips in terms of persons present in them.

[1]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[2]  Naftali Z. Tisby On the application of mixture AR hidden Markov models to text independent speaker recognition , 1991, IEEE Trans. Signal Process..

[3]  Yoshinobu Tonomura,et al.  Stored video handling techniques , 1993 .

[4]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[5]  S. Furui,et al.  Comparison of text-independent speaker recognition methods using vector-quantization distortion and discrete and continuous HMMs , 1994 .

[6]  Ramesh C. Jain,et al.  Indexing in video databases , 1995, Electronic Imaging.

[7]  Lawrence G. Bahler,et al.  Improved voice identification using a nearest-neighbor distance measure , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  M. Sambur Speaker Recognition and Verification using Linear Prediction Analysis , 1973 .

[9]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Thomas D. C. Little,et al.  Video query formulation , 1995, Electronic Imaging.

[11]  Claude Montacié,et al.  Discriminant AR-vector models for free-text speaker verification , 1993, EUROSPEECH.

[12]  Nilesh V. Patel,et al.  Video shot detection and characterization for video databases , 1997, Pattern Recognit..

[13]  Sadaoki Furui,et al.  Speaker recognition using concatenated phoneme models , 1992, ICSLP.

[14]  B. Boyanov,et al.  Text-independent speaker identification using neural nets and AR-vector models , 1994 .

[15]  John S. Boreczky,et al.  Indexes for user access to large video databases , 1994, Electronic Imaging.

[16]  F. Arman,et al.  A Statistical Approach to Scene Change Detection , 1995 .

[17]  Marc Davis,et al.  Media Streams: an iconic visual language for video annotation , 1993, Proceedings 1993 IEEE Symposium on Visual Languages.

[18]  Nilesh V. Patel,et al.  Audio characterization for video indexing , 1996, Electronic Imaging.

[19]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[20]  Lawrence G. Bahler,et al.  Voice identification using nearest-neighbor distance measure , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Rosalind W. Picard,et al.  Orbits': Characterizing the Coordinate Transformation between Two Images Using the Projective Group , 1995 .

[22]  K. P. Li,et al.  An approach to text-independent speaker recognition with short utterances , 1983, ICASSP.

[23]  Shinji Abe,et al.  Scene retrieval method for video database applications using temporal condition changes , 1989, International Workshop on Industrial Applications of Machine Intelligence and Vision,.

[24]  Yukinobu Taniguchi,et al.  Structured Video Computing , 1994, IEEE MultiMedia.

[25]  S. Abe,et al.  Content oriented visual interface using video icons for visual database systems , 1989, [Proceedings] 1989 IEEE Workshop on Visual Languages.

[26]  ZhangHongJiang,et al.  Automatic partitioning of full-motion video , 1993 .