Joint Time–Frequency Scattering

In time series classification and regression, signals are typically mapped into some intermediate representation used for constructing models. Since the underlying task is often insensitive to time shifts, these representations are required to be time-shift invariant. We introduce the joint time–frequency scattering transform, a time-shift invariant representation that characterizes the multiscale energy distribution of a signal in time and frequency. It is computed through wavelet convolutions and modulus non-linearities and may, therefore, be implemented as a deep convolutional neural network whose filters are not learned but calculated from wavelets. We consider the progression from mel-spectrograms to time scattering and joint time–frequency scattering transforms, illustrating the relationship between increased discriminability and refinements of convolutional network architectures. The suitability of the joint time–frequency scattering transform for time-shift invariant characterization of time series is demonstrated through applications to chirp signals and audio synthesis experiments. The proposed transform also obtains state-of-the-art results on several audio classification tasks, outperforming time scattering transforms and achieving accuracies comparable to those of fully learned networks.

[1]  B. Kollmeier,et al.  Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition. , 2015, The Journal of the Acoustical Society of America.

[2]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[3]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[4]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[5]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Michal Valko,et al.  Compressing the Input for CNNs with the First-Order Scattering Transform , 2018, ECCV.

[7]  Sebastian Tschiatschek,et al.  Frame and Segment Level Recurrent Neural Networks for Phone Classification , 2017, INTERSPEECH.

[8]  Stephen McAdams,et al.  A Comparison of Approaches to Timbre Descriptors in Music Information Retrieval and Music Psychology , 2016 .

[9]  Judith C. Brown,et al.  An efficient algorithm for the calculation of a constant Q transform , 1992 .

[10]  Weihua Li,et al.  Wavelet transform based convolutional neural network for gearbox fault classification , 2017, 2017 Prognostics and System Health Management Conference (PHM-Harbin).

[11]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[12]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[13]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[17]  Tara N. Sainath,et al.  Deep Scattering Spectrum with deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Panu Somervuo,et al.  Parametric Representations of Bird Sounds for Automatic Species Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Stéphane Mallat,et al.  Manifold Learning for Latent Variable Inference in Dynamical Systems , 2015, IEEE Transactions on Signal Processing.

[20]  Diemo Schwarz,et al.  State of the Art in Sound Texture Synthesis , 2011 .

[21]  Maarten Versteegh,et al.  A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[23]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[24]  Richard F. Lyon,et al.  On the importance of time—a temporal representation of sound , 1993 .

[25]  Ronen Talmon,et al.  Dynamical system classification with diffusion embedding for ECG-based person identification , 2017, Signal Process..

[26]  Toshiya Hachisuka,et al.  Wavelet Convolutional Neural Networks , 2018, ArXiv.

[27]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[28]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[30]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[31]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[32]  Irène Waldspurger,et al.  Exponential decay of scattering coefficients , 2016, 2017 International Conference on Sampling Theory and Applications (SampTA).

[33]  Les E. Atlas,et al.  A non-uniform modulation transform for audio coding with increased time resolution , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  Mark A. Richards,et al.  Fundamentals of Radar Signal Processing , 2005 .

[35]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[36]  Xavier Serra,et al.  Cross-Collection Evaluation for Music Classification Tasks , 2016, ISMIR.

[37]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[38]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[40]  Kevin Gimpel,et al.  Discriminative segmental cascades for feature-rich phone recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[41]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[42]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[43]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[44]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[45]  Cheng Shi,et al.  3D multi-resolution wavelet convolutional neural networks for hyperspectral image classification , 2017, Inf. Sci..

[46]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[47]  Joakim Andén,et al.  Scattering Transform for Intrapartum Fetal Heart Rate Variability Fractal Analysis: A Case-Control Study , 2014, IEEE Transactions on Biomedical Engineering.

[48]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[49]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[50]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[51]  Thomas Grill,et al.  Inside the spectrogram: Convolutional Neural Networks in audio processing , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[52]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[53]  Arthur Flexer,et al.  Basic filters for convolutional neural networks applied to music: Training or design? , 2017, Neural Computing and Applications.

[54]  S. Mallat A wavelet tour of signal processing , 1998 .

[55]  Benjamin Schrauwen,et al.  Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[56]  Gaël Richard,et al.  Temporal Integration for Audio Classification With Application to Musical Instrument Classification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[58]  Stéphane Mallat,et al.  Audio Texture Synthesis with Scattering Moments , 2013, ArXiv.

[59]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[60]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[61]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[62]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[63]  Stéphane Mallat,et al.  Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[65]  Matthew E. P. Davies,et al.  Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity , 2013, ISMIR.

[66]  Sergey Zagoruyko,et al.  Scaling the Scattering Transform: Deep Hybrid Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).