Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing

Abstract In this paper, we propose a novel front-end speech parameterization technique for automatic speech recognition (ASR) that is less sensitive towards ambient noise and pitch variations. First, using variational mode decomposition (VMD), we break up the short-time magnitude spectrum obtained by discrete Fourier transform into several components. In order to suppress the ill-effects of noise and pitch variations, the spectrum is then sufficiently smoothed. The desired spectral smoothing is achieved by discarding the higher-order variational mode functions and reconstructing the spectrum using the first-two modes only. As a result, the smoothed spectrum closely resembles the spectral envelope. Next, the Mel-frequency cepstral coefficients (MFCC) are extracted using the VMD-based smoothed spectra. The proposed front-end acoustic features are observed to be more robust towards ambient noise and pitch variations than the conventional MFCC features as demonstrated by the experimental evaluations presented in this study. For this purpose, we developed an ASR system using speech data from adult speakers collected under relatively clean recording conditions. State-of-the-art acoustic modeling techniques based on deep neural networks (DNN) and long short-term memory recurrent neural networks (LSTM-RNN) were employed. The ASR systems were then evaluated under noisy test conditions for assessing the noise robustness of the proposed features. To assess robustness towards pitch variations, experimental evaluations were performed on another test set consisting of speech data from child speakers. Transcribing children's speech helps in simulating an ASR task where pitch differences between training and test data are significantly large. The signal domain analyses as well as the experimental evaluations presented in this paper support our claims.

[1]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[2]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[3]  Yao Yao,et al.  Application of the Variational-Mode Decomposition for Seismic Time–frequency Analysis , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[4]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[5]  Shweta Ghai,et al.  Pitch adaptive MFCC features for improving children’s mismatched ASR , 2015, International Journal of Speech Technology.

[6]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[9]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[10]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[12]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[13]  Dominique Zosso,et al.  Variational Mode Decomposition , 2014, IEEE Transactions on Signal Processing.

[14]  Syed Shahnawazuddin,et al.  Effect of Prosody Modification on Children's ASR , 2017, IEEE Signal Processing Letters.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[17]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[19]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[20]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[21]  Syed Shahnawazuddin,et al.  Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[22]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[23]  D. Govind,et al.  Accurate Estimation of Glottal Closure Instants and Glottal Opening Instants from Electroglottographic Signal Using Variational Mode Decomposition , 2018, Circuits Syst. Signal Process..

[24]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[25]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Syed Shahnawazuddin,et al.  Spectral Smoothing by Variationalmode Decomposition and its Effect on Noise and Pitch Robustness of ASR System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Shantanu Chakrabartty,et al.  Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[31]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[32]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[33]  Geoffrey Zweig,et al.  Linear feature space projections for speaker adaptation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[34]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[35]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Kyungmee O. Kim,et al.  Extending the scope of empirical mode decomposition by smoothing , 2012, EURASIP J. Adv. Signal Process..

[37]  Ahmet Mert,et al.  ECG feature extraction based on the bandwidth properties of variational mode decomposition , 2016, Physiological measurement.

[38]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[40]  Ram Bilas Pachori,et al.  Instantaneous voiced/non-voiced detection in speech signals based on variational mode decomposition , 2015, J. Frankl. Inst..

[41]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[42]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[43]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[44]  Jian Cheng,et al.  Using deep neural networks to improve proficiency assessment for children English language learners , 2014, INTERSPEECH.

[45]  Diego Giuliani,et al.  Large vocabulary children's speech recognition with DNN-HMM and SGMM acoustic modeling , 2015, INTERSPEECH.

[46]  K. P. Soman,et al.  Recursive Variational Mode Decomposition Algorithm for Real Time Power Signal Decomposition , 2015 .

[47]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[49]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[50]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[51]  George Saon,et al.  Robust digit recognition in noisy environments: the IBM Aurora 2 system , 2001, INTERSPEECH.

[52]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[53]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[54]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[55]  Joakim Gustafson,et al.  Children's convergence in referring expressions to graphical objects in a speech-enabled computer game , 2007, INTERSPEECH.

[56]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[57]  S. R. Samantaray,et al.  Variational Mode Decomposition and Decision Tree Based Detection and Classification of Power Quality Disturbances in Grid-Connected Distributed Generation System , 2018, IEEE Transactions on Smart Grid.

[58]  Syed Shahnawazuddin,et al.  Assessment of pitch-adaptive front-end signal processing for children's speech recognition , 2018, Comput. Speech Lang..