论文信息 - Automatic Speech Recognition for ageing voices

Automatic Speech Recognition for ageing voices

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.

Ravichander Vipperla | Ravichander Vipperla

[1] Christian A. Müller,et al. Combining short-term cepstral and long-term pitch features for automatic recognition of speaker age , 2007, INTERSPEECH.

[2] Yves Normandin,et al. Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[3] Fernando Pereira,et al. Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[4] C. Ferrand. Harmonics-to-noise ratio: an index of vocal aging. , 2002, Journal of voice : official journal of the Voice Foundation.

[5] A. Rossi,et al. Aging and the respiratory system , 1996, Aging.

[6] K. Kinsella,et al. Global aging : the challenge of success , 2005 .

[7] Jian Wu,et al. Cohorts based custom models for rapid speaker and dialect adaptation , 2001, INTERSPEECH.

[8] Robert L. Mercer,et al. An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[9] J. Baker,et al. The DRAGON system--An overview , 1975 .

[10] Jitendra Ajmera,et al. Age and gender classification using modulation cepstrum , 2008, Odyssey.

[11] H. Tauchi,et al. Age changes in human vocal muscle , 1982, Mechanisms of Ageing and Development.

[12] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[13] Shuichi Itahashi,et al. The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus , 1998, ICSLP.

[14] W. S. Brown,et al. Speaking rate and fundamental frequency as speech cues to perceived age. , 2008, Journal of voice : official journal of the Voice Foundation.

[15] Roeland Ordelman,et al. Transcription of conference room meetings: an investigation , 2005, INTERSPEECH.

[16] S. Kelsen,et al. Comparison of diaphragm strength between healthy adult elderly and young men. , 1995, American journal of respiratory and critical care medicine.

[17] F. Jelinek,et al. Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[18] Mehryar Mohri,et al. Voice signatures , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[19] L. Baum,et al. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[20] T J Doherty,et al. Effects of ageing on the motor unit: a brief review. , 1993, Canadian journal of applied physiology = Revue canadienne de physiologie appliquee.

[21] D. Mahler,et al. The aging lung. , 1986, Clinics in geriatric medicine.

[22] Jay G. Wilpon,et al. A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23] Andreas Stolcke,et al. MLLR transforms as features in speaker recognition , 2005, INTERSPEECH.

[24] Philip C. Woodland,et al. An investigation into vocal tract length normalisation , 1999, EUROSPEECH.

[25] M. Abrahão,et al. Cricoarytenoid joint: histological changes during aging. , 2001, Sao Paulo medical journal = Revista paulista de medicina.

[26] D. W. Robinson,et al. A re-determination of the equal-loudness relations for pure tones , 1956 .

[27] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[28] Sociopsychological Perspectives on Older People's Language and Communication , 1991, Ageing and Society.

[29] Natalie Liberman,et al. Recognition of elderly speech and voice-driven document retrieval , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[30] Vassilios Digalakis,et al. Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[31] R A Kronmal,et al. Respiratory muscle strength in the elderly. Correlates and reference values. Cardiovascular Health Study Research Group. , 1994, American journal of respiratory and critical care medicine.

[32] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[33] H. Koshino,et al. Tongue motor skills and masticatory performance in adult dentates, elderly dentates, and complete denture wearers. , 1997, The Journal of prosthetic dentistry.

[34] M. Hirano,et al. Ageing of the vibratory tissue of human vocal folds. , 1989, Acta oto-laryngologica.

[35] H. B. Mann,et al. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[36] Stanley F. Chen,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[37] B. P. Bogert,et al. The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[38] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[39] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[40] J. Hillenbrand,et al. Acoustic correlates of breathy vocal quality: dysphonic voices and continuous speech. , 1996, Journal of speech and hearing research.

[41] Keikichi Hirose,et al. Automatic estimation of one's age with his/her speech based upon acoustic modeling techniques of speakers , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42] Stephen Cox,et al. Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[43] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[44] Björn W. Schuller,et al. The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[45] M T Rodeño,et al. Histochemical and morphometrical ageing changes in human vocal cord muscles. , 1993, Acta oto-laryngologica.

[46] Chin-Hui Lee,et al. A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[47] Chin-Hui Lee,et al. Joint maximum a posteriori adaptation of transformation and HMM parameters , 2001, IEEE Trans. Speech Audio Process..

[48] Lorraine Olson Ramig,et al. The Aging Voice: A Review, Treatment Data and Familial and Genetic Perspectives , 2001, Folia Phoniatrica et Logopaedica.

[49] Thomas Hain,et al. Applying vocal tract length normalization to meeting recordings , 2005, INTERSPEECH.

[50] R. Klich. Relationships of vowel characteristics to listener ratings of breathiness. , 1982, Journal of speech and hearing research.

[51] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[52] Koichi Shinoda,et al. Structural MAP speaker adaptation using hierarchical priors , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[53] F. Paulsen,et al. Degenerative changes in the human cricoarytenoid joint. , 1998, Archives of otolaryngology--head & neck surgery.

[54] L. Ramig,et al. Effects of physiological aging on selected acoustic characteristics of voice. , 1983, Journal of speech and hearing research.

[55] E. Shuey. Intelligibility of older versus younger adults' CVC productions. , 1989, Journal of communication disorders.

[56] J. Hillenbrand,et al. Acoustic correlates of breathy vocal quality. , 1994, Journal of speech and hearing research.

[57] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[58] Kallirroi Georgila,et al. A Fully Annotated Corpus for Studying the Effect of Cognitive Ageing on Users' Interactions with Spoken Dialogue Systems , 2008, LREC.

[59] Florian Metze,et al. Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[60] Markus Brckl,et al. Women's vocal aging: a longitudinal approach , 2007, INTERSPEECH.

[61] Chin-Hui Lee,et al. Maximum a posteriori linear regression for hidden Markov model adaptation , 1999, EUROSPEECH.

[62] Lukás Burget,et al. The 2005 AMI System for the Transcription of Speech in Meetings , 2005, MLMI.

[63] Mark J. F. Gales. Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[64] I. Good. THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[65] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[66] Kiyohiro Shikano,et al. Elderly acoustic model for large vocabulary continuous speech recognition , 2001, INTERSPEECH.

[67] P. Mermelstein,et al. Distance measures for speech recognition, psychological and instrumental , 1976 .

[68] Kiyohiro Shikano,et al. Unsupervised training of phoneme models using HMM sufficient statistics and a speaker distance function , 2005 .

[69] Chin-Hui Lee,et al. Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[70] Steve An Xue, Dimitar Deliyski. EFFECTS OF AGING ON SELECTED ACOUSTIC VOICE PARAMETERS: PRELIMINARY NORMATIVE DATA AND EDUCATIONAL IMPLICATIONS , 2001 .

[71] E. Mysak. Pitch and duration characteristics of older males. , 1959, Journal of speech and hearing research.

[72] Heiga Zen,et al. Hidden Semi-Markov Model Based Speech Recognition System using Weighted Finite-State Transducer , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[73] Paul Taylor,et al. The architecture of the Festival speech synthesis system , 1998, SSW.

[74] Hermann Ney,et al. Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[75] Steve Young,et al. Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[76] Hermann Ney,et al. On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[77] F. Harris. On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[78] Tatsuya Kawahara,et al. An efficient two-pass search algorithm using word trellis index , 1998, ICSLP.

[79] S. S. Stevens. On the psychophysical law. , 1957, Psychological review.

[80] Jean Carletta,et al. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[81] S. Linville,et al. Source characteristics of aged voice assessed from long-term average spectra. , 2002, Journal of voice : official journal of the Voice Foundation.

[82] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[83] Kiyohiro Shikano,et al. Acoustic models of the elderly for large‐vocabulary continuous speech recognition , 2004 .

[84] Christian A. Müller,et al. Automatic recognition of speakers' age and gender on the basis of empirical studies , 2006, INTERSPEECH.

[85] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[86] Kevin P. Murphy. Hidden semi-Markov models ( HSMMs ) , 2002 .

[87] Li Lee,et al. Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[88] Roger K. Moore. Computer Speech and Language , 1986 .

[89] G Weismer,et al. Selected acoustic characteristics of speech production in very old males. , 1990, Journal of gerontology.

[90] K. Harris,et al. Laryngeal function in phonation and respiration , 1987 .

[91] Martin J. Ball,et al. Voice Quality Measurement , 1999 .

[92] Shrikanth S. Narayanan,et al. A review of ASR technologies for children's speech , 2009, WOCCI.

[93] Nobuaki Minematsu,et al. Japanese dictation toolkit: plug-and-play framework for speech recognition R&D , 1999 .

[94] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[95] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[96] Eric Fosler-Lussier,et al. Effects of speaking rate and word frequency on pronunciations in convertional speech , 1999, Speech Commun..

[97] Richard M. Schwartz,et al. A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[98] Wayne H. Ward,et al. Speech recognition , 1997 .

[99] Susanne Schötz,et al. A perceptual study of speaker age , 2009 .

[100] Christian A. Müller,et al. Exploiting speech for recognizing elderly users to respond to their special needs , 2003, INTERSPEECH.

[101] Paul Boersma,et al. Praat, a system for doing phonetics by computer , 2002 .

[102] J. Ship,et al. Tongue strength and endurance in different aged individuals. , 1996, The journals of gerontology. Series A, Biological sciences and medical sciences.

[103] Steve An Xue,et al. Changes in the human vocal tract due to aging and the acoustic correlates of speech production: a pilot study. , 2003, Journal of speech, language, and hearing research : JSLHR.

[104] David J. Woodruff,et al. Statistical Inference for , 1951 .

[105] Kiyohiro Shikano,et al. Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[106] Daniel Povey,et al. Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[107] Kallirroi Georgila,et al. The MATCH corpus: a corpus of older and younger users’ interactions with spoken dialogue systems , 2010, Lang. Resour. Evaluation.

[108] L. Baum,et al. Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[109] Paul Rother,et al. Morphometrically observable aging changes in the human tongue. , 2002, Annals of anatomy = Anatomischer Anzeiger : official organ of the Anatomische Gesellschaft.

[110] Mark Liberman,et al. Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[111] Andreas Stolcke,et al. Improvements in MLLR-Transform-based Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[112] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[113] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[114] J. Makhoul,et al. Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[115] P. Macklem,et al. Age and sex differences in lung elasticity, and in closing capacity in nonsmokers. , 1976, Journal of applied physiology.

[116] Kallirroi Georgila,et al. Reducing working memory load in spoken dialogue systems , 2009, Interact. Comput..

[117] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[118] E S Luschei,et al. Thyroarytenoid muscle activity associated with hypophonia in Parkinson disease and aging , 1998, Neurology.

[119] Michael Picheny,et al. Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[120] J. Gabrieli,et al. Insights into the ageing mind: a view from cognitive neuroscience , 2004, Nature Reviews Neuroscience.

[121] Steve Austin,et al. The forward-backward search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[122] F. Lederer,et al. SENILE CHANGES IN THE LARYNGEAL MUSCULATURE , 1941 .

[123] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[124] M. Hirano,et al. Age-Related Changes of Elastic Fibers in the Superficial Layer of the Lamina Propria of Vocal Folds , 1997, The Annals of otology, rhinology, and laryngology.

[125] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[126] Shuichi Itahashi,et al. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[127] M. Pretterklieber. Functional Anatomy of the Human Intrinsic Laryngeal Muscles , 2003, European Surgery.

[128] Thomas Niesler,et al. The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[129] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[130] Yuji Matsumoto,et al. Japanese Morphological Analysis System ChaSen version 2.0 Manual , 1999 .

[131] Mei-Yuh Hwang,et al. Subphonetic modeling with Markov states-Senone , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[132] W. Endres,et al. Voice spectrograms as a function of age, voice disguise, and voice imitation. , 1971, The Journal of the Acoustical Society of America.

[133] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[134] P. Lynne-Davies. Influence of age on the respiratory system. , 1977, Geriatrics.

[135] L. F. Black,et al. Maximal respiratory pressures: normal values and relationship to age and sex. , 2015 .

[136] David A. van Leeuwen,et al. The 2007 AMI(DA) System for Meeting Transcription , 2007, CLEAR.

[137] M. Nakayama. [Histological study on aging changes in the human tongue]. , 1991, Nihon Jibiinkoka Gakkai kaiho.

[138] Lukás Burget,et al. Brno university of technology system for interspeech 2010 paralinguistic challenge , 2010, INTERSPEECH.

[139] Jordan Cohen,et al. Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[140] P. Mueller. The Aging Voice , 1997, Seminars in speech and language.

[141] J. L. Hock,et al. An exact recursion for the composite nearest‐neighbor degeneracy for a 2×N lattice space , 1984 .

[142] Mark J. F. Gales,et al. Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..