Classification of general audio data for content-based retrieval

Abstract In this paper, we address the problem of classification of continuous general audio data (GAD) for content-based retrieval, and describe a scheme that is able to classify audio segments into seven categories consisting of silence, single speaker speech, music, environmental noise, multiple speakers' speech, simultaneous speech and music, and speech and noise. We studied a total of 143 classification features for their discrimination capability. Our study shows that cepstral-based features such as the Mel-frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) provide better classification accuracy compared to temporal and spectral features. To minimize the classification errors near the boundaries of audio segments of different type in general audio data, a segmentation–pooling scheme is also proposed in this work. This scheme yields classification results that are consistent with human perception. Our classification system provides over 90% accuracy at a processing speed dozens of times faster than the playing rate.

[1]  Nilesh V. Patel,et al.  Video classification using speaker identification , 1997, Electronic Imaging.

[2]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[3]  C.-C. Jay Kuo,et al.  Audio-guided audiovisual data segmentation, indexing, and retrieval , 1998, Electronic Imaging.

[4]  Victor Zue,et al.  Automatic transcription of general audio data: preliminary analyses , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Joseph G. Agnello A study of intra- and inter-phrasal pauses and their relationship to the rate of speech / , 1963 .

[7]  Richard J. Mammone,et al.  A comparative study of robust linear predictive analysis methods with applications to speaker identification , 1995, IEEE Trans. Speech Audio Process..

[8]  I. K. Sethi,et al.  Hierarchical Classifier Design Using Mutual Information , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[10]  Tsuhan Chen,et al.  Audio feature extraction and analysis for scene classification , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[11]  Paul T. Brady,et al.  A technique for investigating on-off patterns of speech , 1965 .

[12]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[13]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[14]  Nilesh V. Patel,et al.  Audio characterization for video indexing , 1996, Electronic Imaging.

[15]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[16]  Brian C. Smith,et al.  Query By Humming , 2001 .

[17]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[18]  Don Kimber,et al.  Acoustic Segmentation for Audio Browsers , 1997 .

[19]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[20]  C. Saraceno,et al.  Identification of successive correlated camera shots using audio and video information , 1997, Proceedings of International Conference on Image Processing.

[21]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.