Learning spectro-temporal representations of complex sounds with parameterized neural networks

Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.

[1]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Josh H McDermott,et al.  Deep neural network models of sensory systems: windows onto the role of task constraints , 2019, Current Opinion in Neurobiology.

[3]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[4]  Marco Cuturi,et al.  Computational Optimal Transport: With Applications to Data Science , 2019 .

[5]  Zdravko Kacic,et al.  A study of harmonic features for the speaker recognition , 1997, Speech Commun..

[6]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[7]  S. Furukawa,et al.  Cascaded Tuning to Amplitude Modulation for Natural Sound Recognition , 2019, The Journal of Neuroscience.

[8]  Christoph E Schreiner,et al.  Human Superior Temporal Gyrus Organization of Spectrotemporal Modulation Tuning Derived from Speech Stimuli , 2016, The Journal of Neuroscience.

[9]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[10]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Antonio Criminisi,et al.  Adaptive Neural Trees , 2018, ICML.

[12]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[13]  Shihab A. Shamma Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method , 1996 .

[14]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[15]  Mounya Elhilali,et al.  Detection of speech tokens in noise using adaptive spectrotemporal receptive fields , 2015, 2015 49th Annual Conference on Information Sciences and Systems (CISS).

[16]  T. Yarkoni,et al.  Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning , 2017, Perspectives on psychological science : a journal of the Association for Psychological Science.

[17]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[18]  M. Sahani,et al.  Editorial overview: Machine learning, big data, and neuroscience , 2019, Current Opinion in Neurobiology.

[19]  Anne Hsu,et al.  Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds , 2005, Nature Neuroscience.

[20]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[21]  J. Belliveau,et al.  Short-term plasticity in auditory cognition , 2007, Trends in Neurosciences.

[22]  Diego Elgueda,et al.  Laminar profile of task-related plasticity in ferret primary auditory cortex , 2018, Scientific Reports.

[23]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[24]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[25]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[26]  Maneesh Sahani,et al.  Models of Neuronal Stimulus-Response Functions: Elaboration, Estimation, and Evaluation , 2017, Front. Syst. Neurosci..

[27]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[28]  Bernd T. Meyer,et al.  Spectro-temporal Gabor features for speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Etienne Thoret,et al.  Probing machine-learning classifiers using noise, bubbles, and reverse correlation , 2020, Journal of Neuroscience Methods.

[30]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[31]  Sophie Rosset,et al.  A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification , 2020, SLSP.

[32]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[35]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[38]  Wiktor Mlynarski,et al.  Learning Midlevel Auditory Codes from Natural Sound Statistics , 2017, Neural Computation.

[39]  Frédéric E. Theunissen,et al.  The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals , 2016, Animal Cognition.

[40]  Hynek Hermansky,et al.  Deriving Spectro-temporal Properties of Hearing from Speech Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[42]  Essa Yacoub,et al.  Reconstructing the spectrotemporal modulations of real-life sounds from fMRI response patterns , 2017, Proceedings of the National Academy of Sciences.

[43]  Maneesh Sahani,et al.  Input-Specific Gain Modulation by Local Sensory Context Shapes Cortical and Thalamic Responses to Complex Sounds , 2016, Neuron.

[44]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[45]  J. Fritz,et al.  Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex , 2003, Nature Neuroscience.

[46]  Yangyang Xia,et al.  Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination , 2020, INTERSPEECH.

[47]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[48]  Daniel Fogerty,et al.  Improvement and Assessment of Spectro-Temporal Modulation Analysis for Speech Intelligibility Estimation , 2019, INTERSPEECH.

[49]  Nicolas Riche,et al.  Urban Sound Classification : striving towards a fair comparison , 2020, ArXiv.

[50]  Masakiyo Fujimoto,et al.  Exploiting spectro-temporal locality in deep learning based acoustic event detection , 2015, EURASIP J. Audio Speech Music. Process..

[51]  Marc M. van Wanrooij,et al.  Spectrotemporal Response Properties of Core Auditory Cortex Neurons in Awake Monkey , 2015, PloS one.

[52]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[53]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[54]  Josh H. McDermott,et al.  Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception , 2020, Nature Communications.

[55]  Nelson Morgan,et al.  Robust CNN-based speech recognition with Gabor filter kernels , 2014, INTERSPEECH.

[56]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[57]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58]  Wiktor Mlynarski,et al.  Ecological origins of perceptual grouping principles in the auditory system , 2019, Proceedings of the National Academy of Sciences.

[59]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[60]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[61]  N. C. Singh,et al.  Modulation spectra of natural sounds and ethological theories of auditory processing. , 2003, The Journal of the Acoustical Society of America.

[62]  F. Sheldon,et al.  Avian vocalizations and phylogenetic signal. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[63]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .