Characterization of speakers for improved automatic speech recognition

Automatic speech recognition technology is becoming increasingly widespread in many applications. For dictation tasks, where a single talker is to use the system for long periods of time, the high recognition accuracies obtained are in part due to the user performing a lengthy enrolment procedure to ‘tune’ the parameters of the recogniser to their particular voice characteristics and speaking style. Interactive speech systems, where the speaker is using the system for only a short period of time (for example to obtain information) do not have the luxury of long enrolments and have to adapt rapidly to new speakers and speaking styles. This thesis discusses the variations between speakers and speaking styles which result in decreased recognition performance when there is a mismatch between the talker and the systems models. An unsupervised method to rapidly identify and normalise differences in vocal tract length is presented and shown to give improvements in recognition accuracy for little computational overhead. Two unsupervised methods of identifying speakers with similar speaking styles are also presented. The first, a data-driven technique, is shown to accurately classify British and American accented speech, and is also used to improve recognition accuracy by clustering groups of similar talkers. The second uses the phonotactic information available within pronunciation dictionaries to model British and American accented speech. This model is then used to rapidly and accurately classify speakers.

[1]  Horacio Franco,et al.  Acoustic adaptation using nonlinear transformations of HMM parameters , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Steve Young,et al.  Large vocabulary speech recognition , 1995 .

[3]  Stephen Cox Speaker Normalisation in the MFCC Domain , 2000 .

[4]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Timothy J. Hazen,et al.  Segment-based automatic language identification , 1997 .

[6]  S. Furui,et al.  Unsupervised speaker adaptation method based on hierarchical spectral clustering , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  A. Paige,et al.  Calculation of vocal tract length , 1970 .

[8]  X. D. Huang,et al.  Phoneme classification using semicontinuous hidden Markov models , 1992, IEEE Trans. Signal Process..

[9]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[10]  Jean-Luc Gauvain,et al.  Language identification with language-independent acoustic models , 1997, EUROSPEECH.

[11]  W. Tranter,et al.  Signals and Systems: Continuous and Discrete , 1983 .

[12]  C. De La Torre,et al.  Evaluation of the Telefonica I+D Natural Numbers Recognizer over different dialects of Spanish from Spain and America , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[14]  Florian Schiel A new approach to speaker adaptation by modelling pronunciation in automatic speech recognition , 1993, Speech Commun..

[15]  Paul A. Lynn,et al.  Signal Processing of Speech (Macmillan New Electronics) , 1993 .

[16]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[17]  Xuedong Huang,et al.  On semi-continuous hidden Markov modeling , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[18]  Hermann Ney,et al.  Continuous-speech recognition using a stochastic language model , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  J. Flege,et al.  Talker and listener effects on degree of perceived foreign accent. , 1992, The Journal of the Acoustical Society of America.

[20]  J. Flege The detection of French accent by American listeners. , 1984, The Journal of the Acoustical Society of America.

[21]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[22]  Frank J. Owens Signal processing of speech , 1993 .

[23]  Dirk Van Compernolle,et al.  Speaker clustering for dialectic robustness in speaker independent recognition , 1991, EUROSPEECH.

[24]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[25]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[26]  Saeed Vaseghi,et al.  Speech modelling using cepstral-time feature matrices and hidden Markov models , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Masaki Naito,et al.  Speaker clustering for speech recognition using the parameters characterizing vocal-tract dimensions , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Stephen Cox,et al.  Predictive speaker adaptation in speech recognition , 1995, Comput. Speech Lang..

[29]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[30]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[31]  Victoria Sgardoni,et al.  A novel speaker adaptation approach for continuous densities HMM's , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[34]  L. Goddard Information Theory , 1962, Nature.

[35]  J. D. Miller,et al.  Auditory-perceptual interpretation of the vowel. , 1989, The Journal of the Acoustical Society of America.

[36]  Jan C. van der Lubbe,et al.  Information theory , 1997 .

[37]  Steve Young,et al.  WSJCAM0 corpus and recording description , 1994 .

[38]  Earl E. Swartzlander,et al.  Introduction to Mathematical Techniques in Pattern Recognition , 1973 .

[39]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[40]  Francisco Javier Caminero Gil,et al.  Evaluation of the telef nica i+d natural numbers recognizer over different dialects of Spanish from Spain and America , 1996, ICSLP.

[41]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[42]  Edward W. Kamen,et al.  Fundamentals of signals and systems using MATLAB , 1997 .

[43]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[44]  Yonghong Yan,et al.  Development of an approach to automatic language identification based on phone recognition , 1996, Comput. Speech Lang..

[45]  John C. Wells,et al.  Accents of English , 1982 .

[46]  P. Ladefoged A course in phonetics , 1975 .

[47]  Ben P. Milner,et al.  Inclusion of temporal information into features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[48]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[49]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[50]  Hermann Ney,et al.  A model for efficient formant estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[51]  J. Flege Factors affecting degree of perceived foreign accent in English sentences. , 1988, The Journal of the Acoustical Society of America.

[52]  D Kewley-Port,et al.  Auditory models of formant frequency discrimination for isolated vowels. , 1998, The Journal of the Acoustical Society of America.

[53]  Fabio Brugnara,et al.  Techniques for approximating a trigram language model , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[54]  G. E. Peterson Parameters of vowel quality. , 1961, Journal of speech and hearing research.

[55]  John H. L. Hansen,et al.  Language accent classification in American English , 1996, Speech Commun..

[56]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[57]  Chin-Hui Lee,et al.  Bayesian adaptation in speech recognition , 1983, ICASSP.

[58]  Roman Kuc,et al.  Introduction to Digital Signal Processing , 1988 .

[59]  David R. Miller,et al.  Statistical dialect classification based on mean phonetic features , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[60]  Saeed Vaseghi,et al.  Dynamic features for segmental speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[61]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[62]  Iain Matthews,et al.  Features for Audio-Visual Speech Recognition , 1998 .

[63]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[64]  Hideki Kasuya,et al.  Fast and robust joint estimation of vocal tract and voice source parameters , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[66]  W. J. Barry,et al.  An approach to the problem of regional accent in automatic speech recognition , 1989 .

[67]  Michael Picheny,et al.  Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[68]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[69]  R. W. King,et al.  Automatic accent classification of foreign accented Australian English speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[70]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[71]  Isabel Trancoso,et al.  Recognition of non-native accents , 1997, EUROSPEECH.

[72]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[73]  K. J. Power The listening telephone : automating speech recognition over the PSTN , 1996 .

[74]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[75]  Magne Hallstein Johnsen,et al.  Non-linear input transformations for discriminative HMMs , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  A. W. F. Huggins,et al.  The use of shibboleth words for automatically classifying speakers by dialect , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[77]  Geoffrey K. Pullum,et al.  Phonetic Symbol Guide , 1988 .

[78]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[79]  Steve Young,et al.  The HTK book , 1995 .