Pooling Robust Shift-Invariant Sparse Representations of Acoustic Signals

In recent years, designing the coding and pooling structures in layered networks has been shown to be a useful method for learning high-level feature representations for visual data. Yet, such learning structures have not been extensively studied for audio signals. In this paper, we investigate different pooling strategies based on the sparse coding scheme and propose a temporal pyramid pooling method to extract discriminative and shiftinvariant feature representations. We demonstrate the superiority of our new feature representation over traditional features on the acoustic event classification task.

[1]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[2]  Mark Hasegawa-Johnson,et al.  Improving acoustic event detection using generalizable visual features and multi-modality modeling , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[4]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[5]  Mark Hasegawa-Johnson,et al.  Multi-sensory features for personnel detection at border crossings , 2011, 14th International Conference on Information Fusion.

[6]  M. Hasegawa-Johnson,et al.  Exemplar Selection Methods to Distinguish Human from Animal Footsteps , 2011 .

[7]  Thierry Bertin-Mahieux,et al.  On the Use of Sparce Time Relative Auditory Codes for Music , 2008, ISMIR.

[8]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[10]  Michael S. Lewicki,et al.  Efficient Coding of Time-Relative Structure Using Spikes , 2005, Neural Computation.

[11]  Taras Butko,et al.  Audiovisual event detection towards scene understanding , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Harvey Fletcher A SPACE‐TIME PATTERN THEORY OF HEARING , 1930 .

[13]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).