Spectral Smoothing by Variationalmode Decomposition and its Effect on Noise and Pitch Robustness of ASR System

A novel front-end speech parameterization technique that is robust towards ambient noise and pitch variations is proposed in this paper. In the proposed technique, the short-time magnitude spectrum obtained by discrete Fourier transform is first decomposed in several components using variational mode decomposition (VMD). For sufficiently smoothing the spectrum, the higher-order components are discarded. The smoothed spectrum is then obtained by reconstructing the spectrum using the first-two modes only. The Mel-frequency cepstral coefficients computed using the VMD-based smoothed spectra are observed to be affected less by ambient noise and pitch variations. To validate the same, an automatic speech recognition system is developed on clean speech from adult speakers and evaluated under noisy test conditions. Furthermore, experimental evaluations are also performed on another test set which consists of speech data from children to simulate large pitch differences. The experimental evaluations as well as signal domain analyses presented in this paper support these claims.

[1]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[2]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[3]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[4]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[5]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[6]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[8]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[9]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[10]  D. Govind,et al.  Accurate Estimation of Glottal Closure Instants and Glottal Opening Instants from Electroglottographic Signal Using Variational Mode Decomposition , 2018, Circuits Syst. Signal Process..

[11]  Yao Yao,et al.  Application of the Variational-Mode Decomposition for Seismic Time–frequency Analysis , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[12]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[13]  S. R. Samantaray,et al.  Variational Mode Decomposition and Decision Tree Based Detection and Classification of Power Quality Disturbances in Grid-Connected Distributed Generation System , 2018, IEEE Transactions on Smart Grid.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[16]  Ahmet Mert,et al.  ECG feature extraction based on the bandwidth properties of variational mode decomposition , 2016, Physiological measurement.

[17]  Shweta Ghai,et al.  A Study on the Effect of Pitch on LPCC and PLPC Features for Children's ASR in Comparison to MFCC , 2011, INTERSPEECH.

[18]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[19]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[20]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[21]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[22]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[23]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[24]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[25]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[26]  Dominique Zosso,et al.  Variational Mode Decomposition , 2014, IEEE Transactions on Signal Processing.

[27]  K. P. Soman,et al.  Recursive Variational Mode Decomposition Algorithm for Real Time Power Signal Decomposition , 2015 .

[28]  S. Shahnawazuddin,et al.  Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition , 2014, 2014 International Conference on Signal Processing and Communications (SPCOM).

[29]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.