Towards robust features for classifying audio in the CueVideo system

The role of audio in the context of multimedia applications involving video is becoming increasingly important. Many efforts in this area focus on audio data that contains some built-in semantic information structure such as in broadcast news, or focus on classification of audio that contains a single type of sound such as cleaar speech or clear music only. In the CueVideo system, we detect and classify audio that consists of mixed audio, i.e. combinations of speech and music together with other types of background sounds. Segmentation of mixed audio has applications in detection of story boundaries in video, spoken document retrieval systems, audio retrieval systems etc. We modify and combine audio features known to be effective in distinguishing speech from music, and examine their behavior on mixed audio. Our preliminary experimental results show that we can achieve a classification accuracy of over 80% for such mixed audio. Our study also provides us with several helpful insights related to analyzing mixed audio in the context of real applications.

[1]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[2]  Michael J. Witbrock,et al.  Using words and phonetic strings for efficient information retrieval from imperfectly transcribed spoken documents , 1997, DL '97.

[3]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Salim Roukos,et al.  Audio-Indexing For Broadcast News , 1998, TREC.

[5]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[6]  C.-C. Jay Kuo,et al.  Audio-guided audiovisual data segmentation, indexing, and retrieval , 1998, Electronic Imaging.

[7]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[8]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[9]  Lotfi A. Zadeh,et al.  Fuzzy logic, neural networks, and soft computing , 1993, CACM.

[10]  Qian Huang,et al.  Automated semantic structure reconstruction and representation generation for broadcast news , 1998, Electronic Imaging.

[11]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Justin Zobel,et al.  Manipulation of music for melody matching , 1998, MULTIMEDIA '98.

[13]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[14]  Alexander G. Hauptmann,et al.  Speech recognition in the Informedia Digital Video Library: uses and limitations , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[15]  Victor Zue,et al.  Automatic transcription of general audio data: preliminary analyses , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.