Articulatory and spectrum features integration using generalized distillation framework

It has been shown that by combining the acoustic and articulatory information significant performance improvements in automatic speech recognition (ASR) task can be achieved. In practice, however, articulatory information is not available during recognition and the general approach is to estimate it from the acoustic signal. In this paper, we propose a different approach based on the generalized distillation framework, where acoustic-articulatory inversion is not necessary. We trained two DNN models: one called “teacher” learns from both acoustic and articulatory features and the other one called “student” is trained on acoustic features only, but its training process is guided by the “teacher” model and can reach a better performance that can't be obtained by regular training even without articulatory feature inputs during test time. The paper is organized as follows: Section 1 gives the introduction and briefly discusses some related works. Section 2 describes the distillation training process, Section 3 describes ASR system used in this paper. Section 4 presents the experiments and the paper is concluded by Section 5.

[1]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[2]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[3]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[4]  Giorgio Metta,et al.  Integrating articulatory data in deep neural network-based acoustic modeling , 2016, Comput. Speech Lang..

[5]  Khe Chai Sim,et al.  Modeling long temporal contexts for robust DNN-based speech recognition , 2014, INTERSPEECH.

[6]  Raman Arora,et al.  Reconstruction of articulatory measurements with smoothed low-rank matrix completion , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[8]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[9]  Abeer Alwan,et al.  A study of acoustic-to-articulatory inversion of speech by analysis-by-synthesis using chain matrices and the Maeda articulatory model. , 2011, The Journal of the Acoustical Society of America.

[10]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Seiichi Nakagawa,et al.  Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction , 2015, INTERSPEECH.

[14]  Jason Weston,et al.  Inference with the Universum , 2006, ICML.

[15]  Brian Thompson,et al.  Comparing a High and Low-Level Deep Neural Network Implementation for Automatic Speech Recognition , 2014, 2014 First Workshop for High Performance Technical Computing in Dynamic Languages.

[16]  Raman Arora,et al.  Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[18]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[19]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[20]  Carol Y. Espy-Wilson,et al.  Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[22]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[23]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[26]  Jianwu Dang,et al.  Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework , 2006, Speech Commun..

[27]  Peng Liu,et al.  A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Lei Xie,et al.  Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings , 2015, INTERSPEECH.