Automatic Speech Recognition and Intrinsic Speech Variation

This paper briefly reviews state of the art related to the topic of speech variability sources in automatic speech recognition systems. It focuses on some variations within the speech signal that make the ASR task difficult. The variations detailed in the paper are intrinsic to the speech and affect the different levels of the ASR processing chain. For different sources of speech variation, the paper summarizes the current knowledge and highlights specific feature extraction or modeling weaknesses and current trends

[1]  J. C. Steinberg,et al.  Toward the Specification of Speech , 1950 .

[2]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  Tao Chen,et al.  Analysis of Speaker Variability , 2022 .

[5]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[6]  Steven Greenberg,et al.  LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[7]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Y. Patel,et al.  An integrated multi-dialect speech recognition system with optional speaker adaptation , 1995, EUROSPEECH.

[9]  Dirk Van Compernolle,et al.  Speaker clustering for dialectic robustness in speaker independent recognition , 1991, EUROSPEECH.

[10]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[11]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[12]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[13]  Francis Nolan The phonetic bases of speaker recognition : Cambridge Studies in Speech Science and Communication, Cambridge University Press, Cambridge, 1983, 221 pp. ISBN 0-521-24486-2 , 1987, Speech Commun..

[14]  Jean-Claude Junqua,et al.  Large corpus experiments for broadcast news recognition , 2003, INTERSPEECH.

[15]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  M. Eskénazi KIDS: A database of children’s speech , 1996 .

[17]  Katarina Bartkova,et al.  Language based phone model combination for ASR adaptation to foreign accent , 1999 .

[18]  Philip C. Woodland,et al.  Using accent-specific pronunciation modelling for robust speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Satoshi Nakamura,et al.  Introduction to the Special Issue on Spontaneous Speech Processing , 2004, IEEE Trans. Speech Audio Process..

[20]  Shrikanth S. Narayanan,et al.  Analysis of children's speech: duration, pitch and formants , 1997, EUROSPEECH.

[21]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Yunxin Zhao,et al.  Speaker normalization using constrained spectra shifts in auditory filter domain , 1993, EUROSPEECH.

[23]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[24]  Mats Blomberg Collection and recognition of children s speech in the PF-Star project , 2003 .

[25]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[26]  Pascale Fung,et al.  MLLR-based accent model adaptation without accented data , 2000, INTERSPEECH.

[27]  George Zavaliagkos,et al.  Comparative Experiments on Large Vocabulary Speech Recognition , 1993, HLT.

[28]  Ulla Uebler,et al.  Multilingual speech recognition in seven languages , 2001, Speech Commun..

[29]  Lori Lamel,et al.  Investigating syllabic structures and their variation in spontaneous French , 2005, Speech Commun..

[30]  Dirk Van Compernolle Recognizing speech of goats, wolves, sheep and ... non-natives , 2001, Speech Commun..

[31]  Tanja Schultz,et al.  Language independent and language adaptive large vocabulary speech recognition , 1998, ICSLP.

[32]  Maria-Gabriella Di Benedetto,et al.  Extrinsic normalization of vowel formant values based on cardinal vowels mapping , 1992, ICSLP.

[33]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Eric Fosler-Lussier,et al.  Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[35]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Forbes Ave. Pittsburgh,et al.  PINPOINTING PRONUNCIATION ERRORS IN CHILDREN ’ S SPEECH : EXAMINING THE ROLE OF THE SPEECH RECOGNIZER , 2000 .

[37]  Mei-Yuh Hwang,et al.  Improvements on speech recognition for fast talkers , 1999, EUROSPEECH.

[38]  T. M. Nearey Phonetic feature systems for vowels , 1978 .

[39]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[40]  Dan Jurafsky,et al.  Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. , 2003, The Journal of the Acoustical Society of America.

[41]  Daniel Elenius,et al.  Comparing speech recognition for adults and children , 2004 .

[42]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[43]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[44]  Mark A. Clements,et al.  Speech recognition in noise using a projection-based likelihood measure for mixture density HMM's , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Bhuvana Ramabhadran,et al.  Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[46]  Martin Westphal,et al.  The use of cepstral means in conversational speech recognition , 1997, EUROSPEECH.

[47]  James Emil Flege,et al.  Interaction between the native and second language phonetic subsystems , 2003, Speech Commun..

[48]  John H. L. Hansen,et al.  Language accent classification in American English , 1996, Speech Commun..

[49]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[50]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[51]  Katarina Bartkova Generating proper name pronunciation variants for automatic speech recognition , 2003 .

[52]  Eric Fosler-Lussier,et al.  Effects of speaking rate and word frequency on pronunciations in convertional speech , 1999, Speech Commun..

[53]  A. Mertins,et al.  Vocal tract length invariant features for automatic speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[54]  Andreas Stolcke,et al.  A study of multilingual speech recognition , 1997, EUROSPEECH.

[55]  Vassilios Digalakis,et al.  Combination of machine scores for automatic grading of pronunciation quality , 2000, Speech Commun..

[56]  R. Cole,et al.  THE OGI KIDS’ SPEECH CORPUS AND RECOGNIZERS , 2000 .

[57]  Leon Cohen,et al.  Scale transform in speech analysis , 1999, IEEE Trans. Speech Audio Process..