Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

We present a systematic analysis on the performance of a phonetic recogniser when the window of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs) and Hidden Markov Models (HMMs). The objective is to reduce the latency of the system by reducing the number of future feature frames required to estimate the current output. Our tests performed on the TIMIT database show that the performance does not degrade when the input window is shifted up to 5 frames in the past compared to common practice (no future frame). This corresponds to improving the latency by 50 ms in our settings. Our tests also show that the best results are not obtained with the symmetric window commonly employed, but with an asymmetric window with eight past and two future context frames, although this observation should be confirmed on other data sets. The reduction in latency suggested by our results is critical for specific applications such as real-time lip synchronisation for tele-presence, but may also be beneficial in general applications to improve the lag in human-machine spoken interaction.

[1]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[2]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[3]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[8]  Navdeep Jaitly,et al.  Exploring Deep Learning Methods for Discovering Features in Speech Signals , 2014 .

[9]  Hao Li,et al.  Speaker-independent lips and tongue visualization of vowels , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[11]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[15]  Giampiero Salvi Dynamic behaviour of connectionist speech recognition with strong latency constraints , 2006, Speech Commun..

[16]  Jianhua Tao,et al.  Real-time speech-driven lip synchronization , 2010, 2010 4th International Universal Communication Symposium.

[17]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Björn Granström,et al.  SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support , 2009, EURASIP J. Audio Speech Music. Process..