Pitch-robust acoustic feature using single frequency filtering for children's KWS

Abstract The pitch and speaking rate are the two significant factors that cause the acoustic mismatch in children’s keyword spotting (KWS) system. This paper proposes a pitch-robust acoustic feature based on single frequency filtering (SFF) for the development of children’s KWS system. In the proposed approach using SFF, the amplitude envelopes (AEs) of the speech data are computed at D -number of selected frequencies separated in Mel scale. The AEs are then averaged over short-time overlapping analysis frames and logarithmically compressed to represent the D -dimensional feature set per analysis frame, here termed as Mel spaced single frequency average log envelope (MSSF-ALE). By using the proposed MSSF-ALE feature, improved performance is observed for the deep neural network-hidden Markov model-based KWS system over the standard Mel-frequency cepstral coefficients (MFCC) and MFCC extracted from the smoothed spectra. The relative improvement of 104.44% in term-weighted value ( T W V ) for children’s KWS is observed over the MFCC by using MSSF-ALE. The performance of the KWS system is then evaluated with data-augmented training through explicit speaking rate modification of the training data set. The MSSF-ALE provides a relative improvement of 195.94% in T W V over MFCC with the data-augmented training. The MSSF-ALE also results in improved performance than the explored features in noisy test cases.

[1]  Gayadhar Pradhan,et al.  Adaptive spectral smoothening for development of robust keyword spotting system , 2019, IET Signal Process..

[2]  S. Shahnawazuddin,et al.  Enhancing Pitch Robustness of Speech Recognition System through Spectral Smoothing , 2018, 2018 International Conference on Signal Processing and Communications (SPCOM).

[3]  Bayya Yegnanarayana,et al.  Single Frequency Filtering Approach for Discriminating Speech and Nonspeech , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Suryakanth V. Gangashetty,et al.  Detection of Replay Attacks Using Single Frequency Filtering Cepstral Coefficients , 2017, INTERSPEECH.

[5]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Bayya Yegnanarayana,et al.  Epoch extraction from emotional speech using single frequency filtering approach , 2017, Speech Commun..

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[10]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[11]  Suryakanth V. Gangashetty,et al.  SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017 , 2017, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[13]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[15]  Vered Aharonson,et al.  Phonetic Search Methods for Large Speech Databases , 2013, Springer Briefs in Electrical and Computer Engineering.

[16]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[17]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[18]  S. R. Mahadeva Prasanna,et al.  Epoch Extraction From Telephone Quality Speech Using Single Pole Filter , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[21]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  B. Yegnanarayana,et al.  Time Delay Estimation from Mixed Multispeaker Speech Signals Using Single Frequency Filtering , 2020, Circuits Syst. Signal Process..

[23]  Syed Shahnawazuddin,et al.  Assessment of pitch-adaptive front-end signal processing for children's speech recognition , 2018, Comput. Speech Lang..

[24]  Syed Shahnawazuddin,et al.  Pitch-Adaptive Front-End Features for Robust Children's ASR , 2016, INTERSPEECH.

[25]  S. Shahnawazuddin,et al.  Improving the performance of keyword spotting system for children's speech through prosody modification , 2019, Digit. Signal Process..

[26]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[27]  Fabio Valente,et al.  Improving acoustic based keyword spotting using LVCSR lattices , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Syed Shahnawazuddin,et al.  Spectral Smoothing by Variationalmode Decomposition and its Effect on Noise and Pitch Robustness of ASR System , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[30]  Bayya Yegnanarayana,et al.  Robust Estimation of Fundamental Frequency Using Single Frequency Filtering Approach , 2016, INTERSPEECH.

[31]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[32]  Mark A. Clements,et al.  Phonetic Searching vs. LVCSR: How to Find What You Really Want in Audio Archives , 2002, Int. J. Speech Technol..

[33]  B. Yegnanarayana,et al.  Fast prosody modification using instants of significant excitation , 2010 .

[34]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[35]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[36]  Avinash Kumar,et al.  Non-Uniform Spectral Smoothing for Robust Children's Speech Recognition , 2018, INTERSPEECH.

[37]  Shweta Ghai,et al.  A Study on the Effect of Pitch on LPCC and PLPC Features for Children's ASR in Comparison to MFCC , 2011, INTERSPEECH.

[38]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[39]  David Yarowsky,et al.  Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[41]  Bayya Yegnanarayana,et al.  Detection of Glottal Closure Instants in Degraded Speech Using Single Frequency Filtering Analysis , 2018, INTERSPEECH.

[42]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[43]  Douglas C. Montgomery,et al.  Introduction to Statistical Quality Control , 1986 .

[44]  Gilad Mishne,et al.  Automatic analysis of call-center conversations , 2005, CIKM '05.

[45]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[46]  S. Shahnawazuddin,et al.  Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing , 2019, Digit. Signal Process..

[47]  Rohit Sinha,et al.  Analyzing pitch robustness of PMVDR and MFCC features for children's speech recognition , 2010, 2010 International Conference on Signal Processing and Communications (SPCOM).