Robust Features in Deep-Learning-Based Speech Recognition

Recent progress in deep learning has revolutionized speech recognition research, with Deep Neural Networks (DNNs) becoming the new state of the art for acoustic modeling. DNNs offer significantly lower speech recognition error rates compared to those provided by the previously used Gaussian Mixture Models (GMMs). Unfortunately, DNNs are data sensitive, and unseen data conditions can deteriorate their performance. Acoustic distortions such as noise, reverberation, channel differences, etc. add variation to the speech signal, which in turn impact DNN acoustic model performance. A straightforward solution to this issue is training the DNN models with these types of variation, which typically provides quite impressive performance. However, anticipating such variation is not always possible; in these cases, DNN recognition performance can deteriorate quite sharply. To avoid subjecting acoustic models to such variation, robust features have traditionally been used to create an invariant representation of the acoustic space. Most commonly, robust feature-extraction strategies have explored three principal areas: (a) enhancing the speech signal, with a goal of improving the perceptual quality of speech; (b) reducing the distortion footprint, with signal-theoretic techniques used to learn the distortion characteristics and subsequently filter them out of the speech signal; and finally (c) leveraging knowledge from auditory neuroscience and psychoacoustics, by using robust features inspired by auditory perception.

[1]  Horacio Franco,et al.  Coping with Unseen Data Conditions: Investigating Neural Net Architectures, Robust Features, and Information Fusion for Robust Speech Recognition , 2016, INTERSPEECH.

[2]  Nelson Morgan,et al.  Robust CNN-based speech recognition with Gabor filter kernels , 2014, INTERSPEECH.

[3]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  N. Viemeister Temporal modulation transfer functions based upon modulation thresholds. , 1979, The Journal of the Acoustical Society of America.

[6]  Martin Karafiát,et al.  Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  Heinz J. Giegerich,et al.  English Phonology: An Introduction , 1992 .

[8]  Hanseok Ko,et al.  Spectral Subtraction Using Spectral Harmonics for Robust Speech Recognition in Car Environments , 2003, International Conference on Computational Science.

[9]  Hynek Hermansky,et al.  Temporal envelope compensation for robust phoneme recognition using modulation spectrum. , 2010, The Journal of the Acoustical Society of America.

[10]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[12]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[14]  C E Schreiner,et al.  Neural processing of amplitude-modulated sounds. , 2004, Physiological reviews.

[15]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[16]  Lukás Burget,et al.  Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge , 2015, INTERSPEECH.

[17]  Sanjeev Khudanpur,et al.  JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Oded Ghitza,et al.  Auditory nerve representation as a front-end for speech recognition in a noisy environment , 1986 .

[19]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  H. Teager Some observations on oral air flow during phonation , 1980 .

[21]  Hanseok Ko,et al.  A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[23]  Wen Wang,et al.  Toward human-assisted lexical unit discovery without text resources , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[24]  Petros Maragos,et al.  Time-frequency distributions for automatic speech recognition , 2001, IEEE Trans. Speech Audio Process..

[25]  Edward Jones,et al.  Combined speech enhancement and auditory modelling for robust distributed speech recognition , 2008, Speech Commun..

[26]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[27]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[28]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[30]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[31]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[33]  Aaron E. Rosenberg,et al.  Speaker-independent recognition of isolated words using clustering techniques , 1979 .

[34]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[35]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  O Ghitza,et al.  On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. , 2001, The Journal of the Acoustical Society of America.

[38]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[40]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[41]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[42]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[43]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[44]  John Makhoul,et al.  LPCW: An LPC vocoder with linear predictive spectral warping , 1976, ICASSP.

[45]  Sven Nordholm,et al.  Spectral subtraction using reduced delay convolution and adaptive averaging , 2001, IEEE Trans. Speech Audio Process..

[46]  Hynek Hermansky,et al.  Robust speech recognition in unknown reverberant and noisy conditions , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[47]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[48]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Shrikanth S. Narayanan,et al.  Comparing time-frequency representations for directional derivative features , 2014, INTERSPEECH.

[50]  Sree Hari Krishnan Parthasarathi,et al.  Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[51]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[52]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[53]  W A Yost,et al.  Temporal changes in a complex spectral profile. , 1987, The Journal of the Acoustical Society of America.

[54]  Yun Lei,et al.  Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions , 2014, INTERSPEECH.

[55]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  John H. L. Hansen,et al.  Fusion Strategies for Robust Speech Recognition and Keyword Spotting for Channel- and Noise-Degraded Speech , 2016, INTERSPEECH.

[57]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[58]  Benjamin Lindner,et al.  Spontaneous voltage oscillations and response dynamics of a Hodgkin-Huxley type model of sensory hair cells , 2011, Journal of mathematical neuroscience.

[59]  Marc René Schädler,et al.  Comparing Different Flavors of Spectro-Temporal Features for ASR , 2011, INTERSPEECH.

[60]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[61]  Sree Hari Krishnan Parthasarathi,et al.  fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[62]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[63]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[64]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[65]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[66]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[67]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[68]  Wen Wang,et al.  Improving robustness against reverberation for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[69]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[70]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[71]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  Martin Graciarena,et al.  The SRI System for the NIST OpenSAD 2015 Speech Activity Detection Evaluation , 2016, INTERSPEECH.

[73]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[74]  Julien van Hout Low Complexity Spectral Imputation for Noise Robust Speech Recognition , 2012 .

[75]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[76]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[77]  Horacio Franco,et al.  Time-frequency convolutional networks for robust speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[78]  Misha Pavel,et al.  On the importance of various modulation frequencies for speech recognition , 1997, EUROSPEECH.

[79]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[80]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[82]  Martin Graciarena,et al.  Damped oscillator cepstral coefficients for robust speech recognition , 2013, INTERSPEECH.

[83]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[84]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[85]  Dimitra Vergyri,et al.  Medium-duration modulation cepstral feature for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[86]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[87]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[88]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[89]  Stephen V. David,et al.  Representation of Phonemes in Primary Auditory Cortex: How the Brain Analyzes Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[90]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[91]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[92]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[93]  Richard Rose,et al.  Architectures for deep neural network based acoustic models defined over windowed speech waveforms , 2015, INTERSPEECH.

[94]  K. Davis,et al.  Automatic Recognition of Spoken Digits , 1952 .

[95]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[96]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[97]  Mark J. F. Gales,et al.  Investigation of unsupervised adaptation of DNN acoustic models with filter bank input , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[98]  Tran Huy Dat,et al.  Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R'S system description for the ASpIRE challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[99]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[100]  A. Enis Çetin,et al.  Teager energy based feature parameters for speech recognition in car noise , 1999, IEEE Signal Processing Letters.

[101]  Daniel P. W. Ellis,et al.  LP-TRAP: linear predictive temporal patterns , 2004, INTERSPEECH.

[102]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[103]  Wen Wang,et al.  Combating reverberation in large vocabulary continuous speech recognition , 2015, INTERSPEECH.

[104]  Richard M. Stern,et al.  Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[105]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[106]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[107]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[108]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[109]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[110]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[111]  Wen Wang,et al.  Deep convolutional nets and robust features for reverberation-robust speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[112]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[113]  George Saon,et al.  Digit recognition in noisy environments via a sequential GMM/SVM system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[115]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.