Audio Feature Extraction and Analysis for Scene Segmentation and Classification

Understanding of the scene content of a video sequence is very important for content-based indexing and retrieval of multimedia databases. Research in this area in the past several years has focused on the use of speech recognition and image analysis techniques. As a complimentary effort to the prior work, we have focused on using the associated audio information (mainly the nonspeech portion) for video scene analysis. As an example, we consider the problem of discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. A set of low-level audio features are proposed for characterizing semantic contents of short audio clips. The linear separability of different classes under the proposed feature space is examined using a clustering analysis. The effective features are identified by evaluating the intracluster and intercluster scattering matrices of the feature space. Using these features, a neural net classifier was successful in separating the above five types of TV programs. By evaluating the changes between the feature vectors of adjacent clips, we also can identify scene breaks in an audio sequence quite accurately. These results demonstrate the capability of the proposed audio features for characterizing the semantic content of an audio sequence.

[1]  Jeho Nam,et al.  Combined audio and visual streams analysis for video sequence segmentation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[3]  Rekha Govil,et al.  Neural Networks in Signal Processing , 2000 .

[4]  P. Anandan,et al.  Efficient representations of video sequences and their applications , 1996, Signal Process. Image Commun..

[5]  Robert J. Safranek,et al.  Signal compression based on models of human perception , 1993, Proc. IEEE.

[6]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[7]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[10]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  David C. Gibbon,et al.  Pictorial transcripts: multimedia processing applied to digital library creation , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[12]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[13]  Stephen W. Smoliar,et al.  Content based video indexing and retrieval , 1994, IEEE MultiMedia.

[14]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[15]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[16]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[17]  Jonathan D. Courtney Automatic video indexing via object motion analysis , 1997, Pattern Recognit..

[18]  Sun-Yuan Kung,et al.  Face recognition/detection by probabilistic decision-based neural network , 1997, IEEE Trans. Neural Networks.

[19]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[20]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[21]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Tsuhan Chen,et al.  Multimedia content classification using motion and audio information , 1997, Proceedings of 1997 IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age ISCAS '97.

[23]  Shih-Fu Chang,et al.  SaFe: a general framework for integrated spatial and feature image search , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[24]  Boon-Lock Yeo,et al.  Video visualization for compact presentation and fast browsing of pictorial content , 1997, IEEE Trans. Circuits Syst. Video Technol..