Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

This paper presents a new feature extraction algorithm called power normalized Cepstral coefficients (PNCC) that is motivated by auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used in MFCC coefficients, a noise-suppression algorithm based on asymmetric filtering that suppresses background excitation, and a module that accomplishes temporal masking. We also propose the use of medium-time power analysis in which environmental parameters are estimated over a longer duration than is commonly used for speech, as well as frequency smoothing. Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing, and without degrading the recognition accuracy that is observed while training and testing using clean speech. PNCC processing also provides better recognition accuracy in noisy environments than techniques such as vector Taylor series (VTS) and the ETSI advanced front end (AFE) while requiring much less computation. We describe an implementation of PNCC using “online processing” that does not require future knowledge of the input.

[1]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[2]  B. Moore,et al.  A revision of Zwicker's loudness model , 1996 .

[3]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[4]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[5]  Richard M. Stern,et al.  COMPENSATION FOR ENVIRONMENTAL DEGRADATION IN AUTOMATIC SPEECH RECOGNITION , 1999 .

[6]  Finnian Kelly,et al.  A comparison of auditory features for robust speech recognition , 2010, 2010 18th European Signal Processing Conference.

[7]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[8]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[9]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[10]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[11]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[12]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tibor Fegyó,et al.  Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment , 2011, 2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[14]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[15]  Nelson Morgan,et al.  Evaluating long-term spectral subtraction for reverberant ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[16]  Roch Lefebvre,et al.  New approach to voiced onset detection in speech signal and its application for frame error concealment , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[18]  Jae-Won Lee,et al.  Data-driven lexicon refinement using local and web resources for Chinese speech recognition , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[19]  Ian C. Bruce,et al.  Auditory nerve model for predicting performance limits of normal and impaired listeners , 2001 .

[20]  Hynek Hermansky,et al.  Robust spectro-temporal features based on autoregressive models of Hilbert envelopes , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[22]  Climent Nadeu,et al.  On Real-Time Mean-and-Variance Normalization of Speech Recognition Features , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Alfred Mertins,et al.  Contextual invariant-integration features for improved speaker-independent speech recognition , 2011, Speech Commun..

[24]  Abraham Alcaim,et al.  Comparação dos Atributos MFCC, SSCH e PNCC para Reconhecimento Robusto de Voz Contínua , 2011 .

[25]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[26]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[27]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Richard M. Stern,et al.  The effects of background music on speech recognition accuracy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[30]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[31]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[32]  Hynek Hermansky,et al.  Spectral entropy based feature for robust ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[35]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[37]  Martin Heckmann,et al.  A hierarchical framework for spectro-temporal feature extraction , 2011, Speech Commun..

[38]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[39]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[40]  Stephanie Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[41]  Finnian Kelly,et al.  Auditory Features Revisited for Robust Speech Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[42]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[43]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[45]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[46]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[47]  Hermann Ney,et al.  Histogram based normalization in the acoustic feature space , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[48]  Richard M. Stern,et al.  Features Based on Auditory Physiology and Perception , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[49]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[50]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[51]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[53]  Alfred Mertins,et al.  Noise Robust Speaker-Independent Speech Recognition with Invariant-Integration Features Using Power-Bias Subtraction , 2011, INTERSPEECH.

[54]  Shantanu Chakrabartty,et al.  Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Daniel P. W. Ellis,et al.  LP-TRAP: linear predictive temporal patterns , 2004, INTERSPEECH.

[56]  KimChanwoo,et al.  Power-normalized cepstral coefficients (PNCC) for robust speech recognition , 2016 .

[57]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[58]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[59]  Eliathamby Ambikairajah,et al.  A New Forward Masking Model and its Application to Speech Enhancement , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[60]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[61]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[62]  Richard M. Stern,et al.  Power function-based power distribution normalization algorithm for robust speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[63]  Kuldip K. Paliwal,et al.  Robust parameters for speech recognition based on subband spectral centroid histograms , 2001, INTERSPEECH.

[64]  S. Dharanipragada,et al.  Feature extraction for robust speech recognition , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[65]  Richard M. Stern,et al.  Robust speech recognition using a Small Power Boosting algorithm , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[66]  Richard M. Stern,et al.  Signal and Feature Compensa-tion Methods for Robust Speech Recognition , 2002 .

[67]  Richard M. Stern,et al.  Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[69]  Tibor Fegyó,et al.  Recognition of Multiple Language Voice Navigation Queries in Traffic Situations , 2010, COST 2102 Conference.

[70]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[71]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[72]  Keith D. Martin Echo suppression in a computational model of the precedence effect , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.