Untangling in Invariant Speech Recognition

Encouraged by the success of deep convolutional neural networks on a variety of visual tasks, much theoretical and experimental work has been aimed at understanding and interpreting how vision networks operate. At the same time, deep neural networks have also achieved impressive performance in audio processing applications, both as sub-components of larger systems and as complete end-to-end systems by themselves. Despite their empirical successes, comparatively little is understood about how these audio models accomplish these tasks.In this work, we employ a recently developed statistical mechanical theory that connects geometric properties of network representations and the separability of classes to probe how information is untangled within neural networks trained to recognize speech. We observe that speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties such as words and phonemes are untangled in later layers. Higher level concepts such as parts-of-speech and context dependence also emerge in the later layers of the network. Finally, we find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation. Taken together, these findings shed light on how deep auditory models process their time dependent input signals to carry out invariant speech recognition, and show how different concepts emerge through the layers of the network.

[1]  Surya Ganguli,et al.  A theory of multineuronal dimensionality, dynamics and measurement , 2017, bioRxiv.

[2]  Uri Cohen,et al.  Separability and geometry of object manifolds in deep neural networks , 2020, Nature Communications.

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Hung-yi Lee,et al.  Gate Activation Signal Analysis for Gated Recurrent Neural Networks and its Correlation with Phoneme Boundaries , 2017, INTERSPEECH.

[5]  Timo Baumann,et al.  Mining the Spoken Wikipedia for Speech Data and Beyond , 2016, LREC.

[6]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[8]  Surya Ganguli,et al.  On simplicity and complexity in the brave new world of large-scale neuroscience , 2015, Current Opinion in Neurobiology.

[9]  Hod Lipson,et al.  Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.

[10]  Tasha Nagamine,et al.  On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[11]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[12]  Sebastian Stober,et al.  Neuron Activation Profiles for Interpreting Convolutional Speech Recognition Models , 2018 .

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[15]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[16]  Yonatan Belinkov,et al.  On internal language representations in deep learning: an analysis of machine translation and speech recognition , 2018 .

[17]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[19]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[20]  Haim Sompolinsky,et al.  Separability and geometry of object manifolds in deep neural networks , 2019, Nature Communications.

[21]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[22]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[23]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[24]  Daniel D. Lee,et al.  Classification and Geometry of General Perceptual Manifolds , 2017, Physical Review X.

[25]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[26]  David D. Cox,et al.  Untangling invariant object recognition , 2007, Trends in Cognitive Sciences.

[27]  Shuai Wang,et al.  What Does the Speaker Embedding Encode? , 2017, INTERSPEECH.

[28]  Eero P. Simoncelli,et al.  Geodesics of learned representations , 2015, ICLR.

[29]  Yonatan Belinkov,et al.  Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems , 2017, NIPS.

[30]  Jakob H. Macke,et al.  Analyzing biological and artificial neural networks: challenges with opportunities for synergy? , 2018, Current Opinion in Neurobiology.

[31]  Benjamin Lecouteux,et al.  Analyzing Learned Representations of a Deep ASR Performance Prediction Model , 2018, BlackboxNLP@EMNLP.

[32]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[33]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[34]  Haim Sompolinsky,et al.  Optimal Degrees of Synaptic Connectivity , 2017, Neuron.

[35]  Haim Sompolinsky,et al.  Linear readout of object manifolds. , 2015, Physical review. E.

[36]  C. Atencio,et al.  Hierarchical representations in the auditory cortex , 2011, Current Opinion in Neurobiology.

[37]  Eero P. Simoncelli,et al.  Perceptual straightening of natural videos , 2019, Nature Neuroscience.

[38]  E. Gardner The space of interactions in neural network models , 1988 .