Semantic-based Audio Recognition and Retrieval

This study considers the problem of attaching meaning to non-speech sound. The purpose is to ably demonstrate automated annotation of a sound with a string of semantically appropriate words and also retrieval of sounds most relevant to a given textual query. This is achieved by constructing acoustic and semantic spaces from a database of sound and description pairs and using statistical models to learn similarity in each space. The spaces are then linked to allow retrieval in either direction. A key aspect is effective prediction of novel events through generalisation from known examples. The motivation and implementation of the system is described using such techniques and representations as Mel frequency cepstral coefficients, Gaussian mixture models, hierarchical clustering and latent semantic analysis. System results are evaluated with automatic classification measures and human judgements demonstrating that this is an effective method for annotation and retrieval of general sound.

[1]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[3]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[4]  Malcolm Slaney,et al.  Mixtures of probability experts for audio retrieval and indexing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Guojun Lu,et al.  A technique towards automatic audio classification and retrieval , 1998, ICSP '98. 1998 Fourth International Conference on Signal Processing (Cat. No.98TH8344).

[7]  David Pye,et al.  Content-based methods for the management of digital music , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Wendy J. Holmes,et al.  Speech Synthesis and Recognition , 1988 .

[9]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[10]  Steve R. Waterhouse,et al.  Classification using hierarchical mixtures of experts , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[11]  Thomas S. Huang,et al.  CBIR: from low-level features to high-level semantics , 2000, Electronic Imaging.

[12]  Peter W. Foltz,et al.  Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report , 1997, NIPS.

[13]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[14]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[18]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[19]  de Arjen Vries,et al.  The role of evaluation in the development of content-based retrieval techniques , 2000 .

[20]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[21]  Tsuhan Chen,et al.  From Low-Level Features to High-Level Semantics: Are We Bridging the Gap? , 2005, ISM.

[22]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[23]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[24]  Barry Vercoe,et al.  Learning Word Meanings and Descriptive Parameter Spaces from Music , 2003, HLT-NAACL 2003.

[25]  Malcolm Slaney,et al.  Semantic-audio retrieval , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  David V. Anderson,et al.  Low-power audio classification for ubiquitous sensor networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[28]  Daniel P. W. Ellis,et al.  Anchor space for classification and similarity measurement of music , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[29]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[30]  Jonathan Foote,et al.  An overview of audio information retrieval , 1999, Multimedia Systems.

[31]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[32]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[33]  Mingchun Liu,et al.  A study on content-based classification and retrieval of audio database , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[34]  C.-C. Jay Kuo,et al.  Hierarchical system for content-based audio classification and retrieval , 1998, Other Conferences.

[35]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[36]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[37]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[38]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .