Sound Retrieval and Ranking Using Sparse Auditory Representations

To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive polezero filter cascade (PZFC) auditory filter bank and sparse-code feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73 and the average precision was 35, reflecting a 18 improvement over the best competing MFCC front end.

[1]  E. Lopez-Poveda,et al.  A human nonlinear cochlear filterbank. , 2001, The Journal of the Acoustical Society of America.

[2]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[3]  Bruno A Olshausen,et al.  Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.

[4]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[5]  Richard F. Lyon,et al.  On the importance of time—a temporal representation of sound , 1993 .

[6]  Richard R. Fay,et al.  The Mammalian Auditory Pathway: Neurophysiology , 1992, Springer Handbook of Auditory Research.

[7]  Tor Sverre Lande,et al.  Neuromorphic systems engineering: neural networks in silicon , 1998 .

[8]  Richard F. Lyon,et al.  Automatic Gain Control in Cochlear Mechanics , 1990 .

[9]  R. Patterson,et al.  The lower limit of pitch as determined by rate discrimination. , 2000, The Journal of the Acoustical Society of America.

[10]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[11]  Bob L. Sturm,et al.  Recursive nearest neighbor search in a sparse and multiscale domain for comparing audio signals , 2011, Signal Process..

[12]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[13]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[14]  Roy D. Patterson,et al.  A Dynamic Compressive Gammachirp Auditory Filterbank , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[16]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[18]  Richard F. Lyon Filter cascades as analogs of the cochlea , 1998 .

[19]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Antoni B. Chan,et al.  Audio Information Retrieval using Semantic Similarity , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Roy D. Patterson,et al.  Auditory Temporal Asymmetry and Autocorrelation , 2004 .

[22]  Roy D. Patterson,et al.  A FUNCTIONAL MODEL OF NEURAL ACTIVITY PATTERNS AND AUDITORY IMAGES , 2004 .

[23]  Richard F Lyon,et al.  Cascades of two-pole-two-zero asymmetric resonators are good models of peripheral auditory function. , 2011, The Journal of the Acoustical Society of America.

[24]  Richard F. Lyon,et al.  An analog electronic cochlea , 1988, IEEE Trans. Acoust. Speech Signal Process..

[25]  Ryan M. Rifkin,et al.  Musical query-by-description as a multiclass learning problem , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[26]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[27]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[28]  T. Irino,et al.  Comparison of the roex and gammachirp filters as representations of the auditory filter. , 2006, The Journal of the Acoustical Society of America.

[29]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[30]  Jason Weston,et al.  Multi-Tasking with Joint Semantic Spaces for Large-Scale Music Annotation and Retrieval , 2011 .

[31]  Malcolm Slaney,et al.  Semantic-audio retrieval , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  C. Daniel Geisler,et al.  The Mechanics and Biophysics of Hearing , 1990 .

[33]  S. Mallat,et al.  Matching pursuit of images , 1994, Proceedings of IEEE-SP International Symposium on Time- Frequency and Time-Scale Analysis.