Speech recognition from spectral dynamics

Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to well-accepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

[1]  R. R. Riesz Differential Intensity Sensitivity of the Ear for Pure Tones , 1928 .

[2]  H. Dudley The carrier nature of speech , 1940 .

[3]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[4]  P. Ladefoged Three areas of experimental phonetics , 1967 .

[5]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[6]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[7]  John Makhoul,et al.  Spectral linear prediction: Properties and applications , 1975 .

[8]  Ch Chen,et al.  Pattern recognition and artificial intelligence , 1976 .

[9]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[10]  David Marr,et al.  VISION A Computational Investigation into the Human Representation and Processing of Visual Information , 2009 .

[11]  T. Houtgast Frequency selectivity in amplitude-modulation detection. , 1989, The Journal of the Acoustical Society of America.

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Hynek Hermansky,et al.  Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP) , 1991, EUROSPEECH.

[15]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[16]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[17]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[18]  S. Shamma,et al.  Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. , 1996, Journal of neurophysiology.

[19]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[20]  H. Hermansky,et al.  On the properties of temporal processing for speech in adverse environments , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[21]  Lou Boves,et al.  Phase-corrected RASTA for automatic speech recognition over the phone , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Ali Morawej Speech articulation and hearing perception software for the Web , 1997 .

[23]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[25]  T. Dau Modeling auditory processing of amplitude modulation , 1997 .

[26]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[27]  Hynek Hermansky,et al.  Temporal processing of speech in a time-feature space , 1997 .

[28]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[29]  Sarel van Vuuren,et al.  On the importance of components of the modulation spectrum for speaker verification , 1998, ICSLP.

[30]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[31]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[32]  Hynek Hermansky,et al.  DESIRED CHARACTERISTICS OF MODULATION SPECTRUM FOR ROBUST AUTOMATIC SPEECH RECOGNITION , 1998 .

[33]  Hynek Hermansky,et al.  Modulation Spectrum in Speech Processing , 1998 .

[34]  A. Prochazka,et al.  Signal Analysis and Prediction , 1998 .

[35]  Sangita R. Sharma,et al.  Multi-stream approach to robust speech recognition , 1999 .

[36]  D. Ellis,et al.  CONNECTIONIST FEATURE EXTRACTION FOR CONVENTIONAL HMM SYSTEMS , 1999 .

[37]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[38]  H. Hermansky,et al.  Syllable intelligibility for temporally filtered LPC cepstral trajectories. , 1999, The Journal of the Acoustical Society of America.

[39]  M. Hansen,et al.  Modeling speech intelligibility and quality on the basis of the ‘‘effective’’ signal processing in the auditory system , 1999 .

[40]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[41]  Hynek Hermansky,et al.  Data-Driven Analysis of Speech , 1999, TSD.

[42]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[43]  Seung Ho Choi,et al.  Performance Analysis of Automatic Lip Reading Based on Inter-Frame Filtering , 2002 .

[44]  Pratibha Jain Temporal patterns of frequency-localized features in ASR , 2003 .

[45]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[46]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[47]  Daniel P. W. Ellis,et al.  LP-TRAP: linear predictive temporal patterns , 2004, INTERSPEECH.

[48]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[49]  Fabio Valente,et al.  Discriminant linear processing of time-frequency plane , 2006, INTERSPEECH.

[50]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[51]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  H. Hermansky Speech beyond 10 Milliseconds (temporal Filtering in Feature Domain) , 2007 .

[53]  Hynek Hermansky,et al.  Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech , 2008, INTERSPEECH.

[54]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[55]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[56]  T. Poggio,et al.  BOOK REVIEW David Marr’s Vision: floreat computational neuroscience VISION: A COMPUTATIONAL INVESTIGATION INTO THE HUMAN REPRESENTATION AND PROCESSING OF VISUAL INFORMATION , 2009 .

[57]  Hynek Hermansky,et al.  Tandem representations of spectral envelope and modulation frequency features for ASR , 2009, INTERSPEECH.

[58]  H. C. Song,et al.  Feasibility of global-scale synthetic aperture communications. , 2009, The Journal of the Acoustical Society of America.

[59]  Hynek Hermansky,et al.  Modulation frequency features for phoneme recognition in noisy speech. , 2009, The Journal of the Acoustical Society of America.

[60]  Mark J. F. Gales,et al.  Training and adapting MLP features for Arabic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62]  Georg Heigold,et al.  Development of the GALE 2008 Mandarin LVCSR system , 2009, INTERSPEECH.

[63]  Hynek Hermansky,et al.  A phoneme recognition framework based on auditory spectro-temporal receptive fields , 2010, INTERSPEECH.

[64]  Hynek Hermansky,et al.  Toward optimizing stream fusion in multistream recognition of speech. , 2011, The Journal of the Acoustical Society of America.