Auditory Sketches: Sparse Representations of Sounds Based on Perceptual Models

An important question for both signal processing and auditory science is to understand which features of a sound carry the most important information for the listener. Here we approach the issue by introducing the idea of "auditory sketches": sparse representations of sounds, severely impoverished compared to the original, which nevertheless afford good performance on a given perceptual task. Starting from biologically-grounded representations auditory models, a sketch is obtained by reconstructing a highly under-sampled selection of elementary atoms. Then, the sketch is evaluated with a psychophysical experiment involving human listeners. The process can be repeated iteratively. As a proof of concept, we present data for an emotion recognition task with short non-verbal sounds. We investigate 1/ the type of auditory representation that can be used for sketches 2/ the selection procedure to sparsify such representations 3/ the smallest number of atoms that can be kept 4/ the robustness to noise. Results indicate that it is possible to produce recognizable sketches with a very small number of atoms per second. Furthermore, at least in our experimental setup, a simple and fast under-sampling method based on selecting local maxima of the representation seems to perform as well or better than a more traditional algorithm aimed at minimizing the reconstruction error. Thus, auditory sketches may be a useful tool for choosing sparse dictionaries, and also for identifying the minimal set of features required in a specific perceptual task.

[1]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[2]  A. Chambolle Practical, Unified, Motion and Missing Data Treatment in Degraded Video , 2004, Journal of Mathematical Imaging and Vision.

[3]  Laurent Daudet,et al.  Boltzmann Machine and Mean-Field Approximation for Structured Sparse Decompositions , 2012, IEEE Transactions on Signal Processing.

[4]  Clara Suied,et al.  Fast recognition of musical sounds based on timbre. , 2012, The Journal of the Acoustical Society of America.

[5]  Armin Schwartzman,et al.  MULTIPLE TESTING OF LOCAL MAXIMA FOR DETECTION OF PEAKS IN 1D. , 2011, Annals of statistics.

[6]  D. Gabor Acoustical Quanta and the Theory of Hearing , 1947, Nature.

[7]  ANTONIN CHAMBOLLE,et al.  An Algorithm for Total Variation Minimization and Applications , 2004, Journal of Mathematical Imaging and Vision.

[8]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[9]  Gabriel Peyré,et al.  Learning Analysis Sparsity Priors , 2011 .

[10]  Javier Portilla,et al.  Image restoration through l0 analysis-based sparse optimization in tight frames , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[11]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[12]  Rémi Gribonval,et al.  Sparse Representations in Audio and Music: From Coding to Source Separation , 2010, Proceedings of the IEEE.

[13]  T. Blumensath,et al.  Iterative Thresholding for Sparse Approximations , 2008 .

[14]  S. Mallat A wavelet tour of signal processing , 1998 .

[15]  Michael Elad,et al.  Analysis versus synthesis in signal priors , 2006, 2006 14th European Signal Processing Conference.

[16]  R. Patterson,et al.  Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.

[17]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[18]  Mounya Elhilali,et al.  Music in Our Ears: The Biological Bases of Musical Timbre Perception , 2012, PLoS Comput. Biol..

[19]  Nicolas Sturmel,et al.  SIGNAL RECONSTRUCTION FROM STFT MAGNITUDE : A STATE OF THE ART , 2011 .

[20]  Jae Lim,et al.  Signal reconstruction from short-time Fourier transform magnitude , 1983 .

[21]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[22]  Michael S. Lew,et al.  Face detection using local maxima , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[23]  Bernhard Laback,et al.  Time–Frequency Sparsity by Removing Perceptually Irrelevant Components Using a Simple Model of Simultaneous Masking , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[25]  Michael Elad,et al.  Cosparse analysis modeling - uniqueness and algorithms , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Nima Mesgarani,et al.  Denoising in the Domain of Spectrotemporal Modulations , 2007, EURASIP J. Audio Speech Music. Process..

[27]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[28]  Kuansan Wang,et al.  Auditory representations of acoustic signals , 1992, IEEE Trans. Inf. Theory.

[29]  F. Gosselin,et al.  The Montreal Affective Voices: A validated set of nonverbal affect bursts for research on auditory affective processing , 2008, Behavior research methods.

[30]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .