Creating speaker independent ASR system through prosody modification based data augmentation

Abstract In this paper, the effect of prosody-modification-based data augmentation is explored in the context of automatic speech recognition (ASR). The primary motive is to develop ASR systems that are less affected by speaker-dependent acoustic variations. Two factors contributing towards inter-speaker variability that are focused on in this paper are pitch and speaking-rate variations. In order to simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Compared to adults’ speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is basically due to large differences in pitch and speaking-rate between adults’ and children’s speech. To overcome this problem, pitch and speaking-rate of the training speech are modified to create new versions of the data. The original and the modified versions are then pooled together in order to capture greater acoustic variability. The ASR system trained on augmented data is noted to be more robust towards speaker-dependent variations. Relative improvements of 11.5% and 27.0% over the baseline are obtained on decoding adults’ and children’s speech test sets, respectively.

[1]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[2]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[3]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[7]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Syed Shahnawazuddin,et al.  Assessment of pitch-adaptive front-end signal processing for children's speech recognition , 2018, Comput. Speech Lang..

[10]  Hynek Hermansky,et al.  Robust speech recognition in unknown reverberant and noisy conditions , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Syed Shahnawazuddin,et al.  Pitch-Normalized Acoustic Features for Robust Children's Speech Recognition , 2017, IEEE Signal Processing Letters.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Syed Shahnawazuddin,et al.  Effect of Prosody Modification on Children's ASR , 2017, IEEE Signal Processing Letters.

[14]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[15]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[16]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[17]  K. T. Deepak,et al.  Speech and EGG polarity detection using Hilbert Envelope , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[19]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[21]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[22]  B. Yegnanarayana,et al.  Fast prosody modification using instants of significant excitation , 2010 .

[23]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[24]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[25]  Sanjeev Khudanpur,et al.  JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[27]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[28]  Elmar Nöth,et al.  Acoustic normalization of children's speech , 2003, INTERSPEECH.

[29]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[30]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.