Determining computable scenes in films and their structures using audio-visual memory models

In this paper we present novel algorithms for computing scenes and within-scene structures in films. We begin by mapping insights from film-making rules and experimental results from the psychology of audition into a computational scene model. We define a computable scene to be a chunk of audio-visual data that exhibits long-term consistency with regard to three properties: (a) chromaticity (b) lighting (c) ambient sound. Central to the computational model is the notion of a causal, finite-memory viewer model. We segment the audio and video data separately. In each case we determine the degree of correlation of the most recent data in the memory with the past. The respective scene boundaries are determined using local minima and aligned using a nearest neighbor algorithm. We introduce a periodic analysis transform to automatically determine the structure within a scene. We then use statistical tests on the transform to determine the presence of a dialogue. The algorithms were tested on a difficult data set: five commercial films. We take the first hour of data from each of the five films. The best results: scene detection: 88% recall and 72% precision, dialogue detection: 91% recall and 100% precision.

[1]  Shih-Fu Chang,et al.  Audio scene segmentation using multiple features, models and time scales , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[3]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[4]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Dragutin Petkovic,et al.  Towards robust features for classifying audio in the CueVideo system , 1999, MULTIMEDIA '99.

[6]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[7]  C.-C. Jay Kuo,et al.  Heuristic approach for generic audio data segmentation and annotation , 1999, MULTIMEDIA '99.

[8]  Boon-Lock Yeo,et al.  Video content characterization and compaction for digital library applications , 1997, Electronic Imaging.

[9]  Anoop Gupta,et al.  Auto-summarization of audio-video presentations , 1999, MULTIMEDIA '99.

[10]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[11]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[12]  Wolfgang Effelsberg,et al.  Automatic Movie Abstracting , 1997 .

[13]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[14]  Shingo Uchihashi,et al.  Video Manga: generating semantically meaningful video summaries , 1999, MULTIMEDIA '99.

[15]  Michael G. Christel,et al.  Evolving video skims into useful multimedia abstractions , 1998, CHI.

[16]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[17]  William H. Press,et al.  Numerical recipes in C , 2002 .

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  Shih-Fu Chang,et al.  CVEPS - a compressed video editing and parsing system , 1997, MULTIMEDIA '96.

[20]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).