Multichannel video segmentation

A video is a multimedia document which is structured in scenes and shots. Scenes are lists of consecutive shots characterized by common visual and audio features. Shots are sets of consecutive frames separated by cuts, which can be easily recognized by existing techniques. Video segmentation into scenes is a new and open problem. It is needed for scenes retrieval, specially in authoring and interactive video applications. We propose a new approach of video segmentation into scenes, which is based on several media and takes into account the film syntax. We characterize a scene by some similarity between color histograms of the current shot, and of one of the most recent previous shots. Similarity between a shot frame and a frame of a previous shot may indicate the presence of alternate shots, which belong to the same scene. Other techniques based on projective geometry are presented in a companion paper. These techniques enable to detect the movement of the camera. We recognize the speakers of a scene by AR vector model techniques, such as the one proposed by some of the authors in the Orphee system, implemented at Laforia. However the speaker recognition problem is much more difficult when applied to the video CD-I, due to several transition types and various types of noise. We present experimental results, based on this approach. Detection of alternate shots is efficient, but speaker recognition needs improvements.

[1]  M. Sugiyama,et al.  Speech segmentation and clustering based on speaker features , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[3]  Marti A. Hearst,et al.  Metadata for mixed-media access , 1994, SGMD.

[4]  Peter Schäuble,et al.  Metadata for integrating speech documents in a text retrieval system , 1994, SGMD.

[5]  Claude Montacié,et al.  Discriminant AR-vector models for free-text speaker verification , 1993, EUROSPEECH.

[6]  Liming Chen,et al.  Video segmentation using 3D hints contained in 2D images , 1996, Other Conferences.

[7]  W. Fisher,et al.  An acoustic‐phonetic data base , 1987 .

[8]  Tat-Seng Chua,et al.  A video retrieval and sequencing system , 1995, TOIS.

[9]  Andrew Laursen,et al.  Oracle media server: providing consumer based interactive access to multimedia data , 1994, SIGMOD '94.

[10]  Jesper Ø. Olsen Separation of speakers in audio data , 1995, EUROSPEECH.

[11]  H. Gish,et al.  An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Francine Chen,et al.  Segmentation of speech using speaker identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Ramesh C. Jain,et al.  Metadata in video databases , 1994, SGMD.

[14]  Michael Stonebraker,et al.  Chabot: Retrieval from a Relational Database of Images , 1995, Computer.

[15]  Dragutin Petkovic,et al.  Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review , 1996 .

[16]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[17]  Herbert Gish,et al.  Identification of speakers engaged in dialog , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.