Visualizing Phoneme Category Adaptation in Deep Neural Networks

Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN’s ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.

[1]  Odette Scharenborg,et al.  Modeling the use of durational information in human spoken-word recognition. , 2010, The Journal of the Acoustical Society of America.

[2]  Paul A. Luce,et al.  Does perceptual learning in speech reflect changes in phonetic category representation or decision bias? , 2008, Perception & psychophysics.

[3]  Odette Scharenborg,et al.  Processing and Adaptation to Ambiguous Sounds during the Course of Perceptual Learning , 2016, INTERSPEECH.

[4]  James R. Glass,et al.  Non-Negative Factor Analysis of Gaussian Mixture Model Weight Adaptation for Language and Dialect Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  D. Norris,et al.  Perceptual learning in speech , 2003, Cognitive Psychology.

[6]  E. Janse,et al.  Comparing lexically guided perceptual learning in younger and older listeners , 2013, Attention, Perception, & Psychophysics.

[7]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[8]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[9]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Louis ten Bosch,et al.  How Should a Speech Recognizer Work? , 2005, Cogn. Sci..

[11]  R. Weale Vision. A Computational Investigation Into the Human Representation and Processing of Visual Information. David Marr , 1983 .

[12]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[13]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[14]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[15]  S. Goldinger Echoes of echoes? An episodic theory of lexical access. , 1998, Psychological review.

[16]  Lawrence R. Rabiner,et al.  On integrating insights from human speech perception into automatic speech recognition , 2005, INTERSPEECH.

[17]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[18]  R. Hout,et al.  Lexically-guided perceptual learning in non-native listening , 2016 .

[19]  James M. McQueen,et al.  The Time Course of Perceptual Learning , 2011, ICPhS.

[20]  Patrick Wambacq,et al.  Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[21]  Najim Dehak I-Vector Representation Based on GMM and DNN for Audio Classification , 2016, Odyssey.

[22]  Bart de Boer,et al.  Investigating the role of infant-directed speech with a computer model , 2003 .

[23]  Odette Scharenborg,et al.  Speech perception by humans and machines , 2017 .