Robust auditory-based speech processing using the average localized synchrony detection

A new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response's spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise.

[1]  M. Hunt,et al.  Speaker dependent and independent speech recognition experiments with an auditory model , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  A M Ali,et al.  Acoustic-phonetic features for the automatic classification of fricatives. , 2001, The Journal of the Acoustical Society of America.

[3]  Jan Van der Spiegel,et al.  AUTOMATIC DETECTION AND CLASSIFICATION OF STOP CONSONANTS USING AN ACOUSTIC-PHONETIC FEATURE-BASED SYSTEM , 1999 .

[4]  Timothy R. Anderson,et al.  A comparison of auditory models for speaker independent phoneme recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[6]  Oded Ghitza,et al.  A comparative study of mel cepstra and EIH for phone classification under adverse conditions , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  G. Fant,et al.  Two-formant Models, Pitch and Vowel Perception , 1975 .

[8]  Oded Ghitza,et al.  Auditory models and human performance in tasks related to speech coding and speech recognition , 1994, IEEE Trans. Speech Audio Process..

[9]  M. Hunt,et al.  Speech recognition using an auditory model with pitch-synchronous analysis , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Jan Van der Spiegel,et al.  Auditory-based acoustic-phonetic signal processing for robust continuous speech recognition , 1999 .

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Jan Van der Spiegel,et al.  Acoustic-phonetic features for the automatic classification of stop consonants , 2001, IEEE Trans. Speech Audio Process..

[13]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[14]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[15]  Richard F. Lyon,et al.  An analog electronic cochlea , 1988, IEEE Trans. Acoust. Speech Signal Process..

[16]  S. Greenberg,et al.  The ear as a speech analyzer , 1988 .

[17]  James M. Kates,et al.  A time-domain digital cochlear model , 1991, IEEE Trans. Signal Process..

[18]  Oded Ghitza,et al.  Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment , 1988 .

[19]  Yoshiaki Ohshima,et al.  Environmental robustness in speech recognition using physiologically-motivated signal processing , 1993 .

[20]  Stephanie Seneff,et al.  Pitch and spectral analysis of speech based on an auditory synchrony model , 1985 .

[21]  J. Zwislocki,et al.  Short-term adaptation and incremental responses of single auditory-nerve fibers , 1975, Biological Cybernetics.

[22]  Jan Van der Spiegel,et al.  An acoustic-phonetic feature-based system for the automatic recognition of fricative consonants , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[24]  Etienne Barnard,et al.  Efficient estimation of perceptual features for speech recognition , 1997, EUROSPEECH.

[25]  Stephanie Sene A joint synchrony/mean-rate model of au-ditory speech processing , 1988 .

[26]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[27]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[28]  Kuldip K. Paliwal,et al.  A study of two-formant models for vowel identification , 1983, Speech Commun..

[29]  Jan Van der Spiegel,et al.  Robust classification of stop consonants using auditory-based speech processing , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[30]  Mohammed Ismail,et al.  Analog VLSI Implementation of Neural Systems , 2011, The Kluwer International Series in Engineering and Computer Science.

[31]  B. Delgutte Speech coding in the auditory nerve: II. Processing schemes for vowel-like sounds. , 1984, The Journal of the Acoustical Society of America.

[32]  C. D. Geisler,et al.  A composite auditory model for processing speech sounds. , 1987, The Journal of the Acoustical Society of America.

[33]  P. Dallos,et al.  Forward masking of auditory nerve fiber responses. , 1979, Journal of neurophysiology.

[34]  Jan Van der Spiegel,et al.  Auditory-based speech processing based on the average localized synchrony detection , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[35]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[36]  Daniel C. Marcus,et al.  46 – Acoustic Transduction , 2001 .

[37]  M. Sachs,et al.  Effects of nonlinearities on speech encoding in the auditory nerve. , 1979, The Journal of the Acoustical Society of America.

[38]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[39]  Weimin Liu,et al.  Voiced-speech representation by an analog silicon model of the auditory periphery , 1992, IEEE Trans. Neural Networks.

[40]  Hynek Hermansky,et al.  Perceptual Linear Predictive (PLP) Analysis-Resynthesis Technique , 1991, Final Program and Paper Summaries 1991 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

[41]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[42]  Douglas D. O'Shaughnessy,et al.  Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition , 1999, IEEE Trans. Speech Audio Process..

[43]  S. Shamma Speech processing in the auditory system. I: The representation of speech sounds in the responses of the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[44]  Steven Greenberg,et al.  Acoustic transduction in the auditory periphery , 1988 .

[45]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[46]  M. Sachs,et al.  Encoding of steady-state vowels in the auditory nerve: representation in terms of discharge rate. , 1979, The Journal of the Acoustical Society of America.

[47]  Hamid Sheikhzadeh,et al.  Speech analysis and recognition using interval statistics generated from a composite auditory model , 1998, IEEE Trans. Speech Audio Process..

[48]  Richard M. Stern,et al.  Multiple Approaches to Robust Speech Recognition , 1992, HLT.

[49]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[50]  Shihab A. Shamma,et al.  The acoustic features of speech sounds in a model of auditory processing: vowels and voiceless fricatives , 1988 .

[51]  W. S. Rhode,et al.  A composite model of the auditory periphery for the processing of speech based on the filter response functions of single auditory-nerve fibers. , 1991, The Journal of the Acoustical Society of America.

[52]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  J. Allen,et al.  Cochlear modeling , 1985, IEEE ASSP Magazine.

[54]  M. Sachs,et al.  Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers. , 1979, The Journal of the Acoustical Society of America.