Far-Field ASR Using Low-Rank and Sparse Soft Targets from Parallel Data

Far-field automatic speech recognition (ASR) of conversational speech is often considered to be a very challenging task due to the poor quality of alignments available for training the DNN acoustic models. A common way to alleviate this problem is to use clean alignments obtained from parallelly recorded close-talk speech data. In this work, we advance the parallel data approach by obtaining enhanced low-rank and sparse soft targets from a close-talk ASR system and using them for training more accurate far-field acoustic models. Specifically, we (i) exploit eigenposteriors and Compressive Sensing dictionaries to learn low-dimensional senone subspaces in DNN posterior space, and (ii) enhance close-talk DNN posteriors to achieve high quality soft targets for training far-field DNN acoustic models. We show that the enhanced soft targets encode the structural and temporal interrelationships among senone classes which are easily accessible in the DNN posterior space of close-talk speech but not in its noisy far-field counterpart. We exploit enhanced soft targets to improve the mapping of far-field acoustics to close-talk senone classes. The experiments are performed on AMI meeting corpus where our approach improves DNN based acoustic modeling by 4.4% absolute (~8% rel.) reduction in WER as compared to a system which doesn’t use parallel data. Finally, the approach is also validated on state-of-the-art recurrent and time delay neural network architectures.

[1]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[2]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[4]  Koichi Shinoda,et al.  Wise teachers train better DNN acoustic models , 2016, EURASIP J. Audio Speech Music. Process..

[5]  Jonathan G. Fiscus,et al.  Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech , 2006, LREC.

[6]  Paul W. Fieguth,et al.  A functional articulatory dynamic model for speech production , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Hervé Bourlard,et al.  Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hervé Bourlard,et al.  Exploiting Eigenposteriors for Semi-Supervised Training of DNN Acoustic Models with Sequence Discrimination , 2017, INTERSPEECH.

[9]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[10]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[11]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[13]  Yiming Wang,et al.  Far-Field ASR Without Parallel Data , 2016, INTERSPEECH.

[14]  Petr Motlícek,et al.  Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jungwon Lee,et al.  Bridgenets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and Its Application to Distant Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[17]  Hervé Bourlard,et al.  Low-rank and sparse soft targets to learn better DNN acoustic models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[19]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Dong Yu,et al.  An investigation into using parallel data for far-field speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[24]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[25]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[27]  Souvik Kundu,et al.  Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[28]  Li Deng,et al.  Switching Dynamic System Models for Speech Articulation and Acoustics , 2004 .

[29]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[30]  Hervé Bourlard,et al.  Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition , 2016, Speech Commun..

[31]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.