论文信息 - Deep Scattering Spectrum

Deep Scattering Spectrum

A scattering transform defines a locally translation invariant representation which is stable to time-warping deformation. It extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators. Second-order scattering coefficients characterize transient phenomena such as attacks and amplitude modulation. A frequency transposition invariant representation is obtained by applying a scattering transform along log-frequency. State-the-of-art classification results are obtained for musical genre and phone classification on GTZAN and TIMIT databases, respectively.

Joakim Andén | Stéphane Mallat | S. Mallat | J. Andén

[1] Daniel P. W. Ellis,et al. Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Tara N. Sainath,et al. A convex hull approach to sparse representations for exemplar-based speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Eero P. Simoncelli,et al. Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[5] Juan Pablo Bello,et al. Learning a robust Tonnetz-space transform for automatic chord recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Malcolm Slaney,et al. Solving Demodulation as an Optimization Problem , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] B. Kollmeier,et al. Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[9] Powen Ru,et al. Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[10] Pedro J. Moreno,et al. On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11] H. Hermansky,et al. The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[13] S. W. Beet,et al. Visual representations of speech signals , 1993 .

[14] Joakim Andén,et al. Multiscale Scattering for Audio Classification , 2011, ISMIR.

[15] Andrew K. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[16] Richard E. Turner,et al. Probabilistic amplitude and frequency demodulation , 2011, NIPS.

[17] Bob L. Sturm. An analysis of the GTZAN music genre dataset , 2012, MIRUM '12.

[18] Joakim Andén,et al. Scattering transform for intrapartum fetal heart rate characterization and acidosis detection , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[19] Geoffroy Peeters,et al. Audio identification based on spectral modeling of bark-bands energy and synchronization through onset detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Hsiao-Wuen Hon,et al. Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[21] S. Mallat. A wavelet tour of signal processing , 1998 .

[22] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[23] Joan Bruna. Scattering Representations for Recognition , 2013 .

[24] Roy D. Patterson,et al. Auditory images:How complex sounds are represented in the auditory system , 2000 .

[25] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Kun-Ming Yu,et al. Automatic Music Genre Classification Based on Modulation Spectral Analysis of Spectral and Cepstral Features , 2009, IEEE Transactions on Multimedia.

[27] Les E. Atlas,et al. A non-uniform modulation transform for audio coding with increased time resolution , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[28] Stéphane Mallat,et al. Group Invariant Scattering , 2011, ArXiv.

[29] Douglas Eck,et al. Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[30] Stéphane Mallat,et al. Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[31] George Tzanetakis,et al. Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies , 2011, MM 2011.

[32] David Wessel,et al. Analyzing Drum Patterns Using Conditional Deep Belief Networks , 2012, ISMIR.

[33] Michael S. Lewicki,et al. Efficient auditory coding , 2006, Nature.

[34] Les E. Atlas,et al. Coherent envelope detection for modulation filtering of speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35] George Tzanetakis,et al. Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[36] Xu Chen,et al. Music genre classification using multiscale scattering and sparse representations , 2013, 2013 47th Annual Conference on Information Sciences and Systems (CISS).

[37] Yann LeCun,et al. Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[38] Juhan Nam,et al. Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[39] Stéphane Mallat,et al. Phase Retrieval for the Cauchy Wavelet Transform , 2014, ArXiv.

[40] Alexandre d'Aspremont,et al. Phase recovery, MaxCut and complex semidefinite programming , 2012, Math. Program..

[41] Stéphane Mallat,et al. Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Les E. Atlas,et al. Scalable and progressive audio codec , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[43] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[44] Li Deng,et al. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45] Richard F. Lyon,et al. On the importance of time—a temporal representation of sound , 1993 .

[46] Hung-An Chang,et al. Hierarchical large-margin Gaussian mixture models for phonetic classification , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[47] Les E. Atlas,et al. EURASIP Journal on Applied Signal Processing 2003:7, 668–675 c ○ 2003 Hindawi Publishing Corporation Joint Acoustic and Modulation Frequency , 2003 .

[48] Joakim Andén,et al. Representing environmental sounds using the separable scattering transform , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49] Nima Mesgarani,et al. Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[50] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[51] Yann LeCun,et al. Unsupervised Learning of Sparse Features for Scalable Audio Classification , 2011, ISMIR.

[52] L. Lucy. An iterative technique for the rectification of observed distributions , 1974 .

[53] Yonina C. Eldar,et al. Phase Retrieval via Matrix Completion , 2011, SIAM Rev..