Speech Analyser in an ICAI System for TESOL

There is an increasing demand for computer software which can provide useful personalised feedback to English as a Second Language (ESL) speakers on prosodic aspects of their speech, to supplement the shortage of ESL teachers and reduce the cost of learning. This thesis concentrates on constructing such an Intelligent Computer Aided Instruction (ICAI) prototype system, particularly focusing on one component — the Speech Analyser. The speech analyser recognises a user’s speech, identifies the rhythmic stress pattern in the speech, discovers stress and rhythm errors in the speech, and provides reports for the other component generating personalised feedback to the user on ways of effectively improving the prosodic aspects of the speech. We build an Hidden Markov Model (HMM) based speech recogniser to recognise a user’s speech. A set of parameters for constructing the recogniser is investigated by an exhaustive experiment implemented in a client/server computing network. The exploration suggests that the choice of parameters is very important. We build stress detectors to detect the rhythmic stress pattern in the user’s speech by using both Support Vector Machine (SVM) and Decision Tree (DT) techniques. The detector using SVM outperforms the one using DT. It suggests that SVM is more suitable for a relatively large data set with all numeric data than DT. We build an error identifier to automatically identify stress and rhythm errors in the user’s speech. A two-layer phoneme alignment algorithm using the Needleman/Wunsch technique is developed to facilitate the prosodic error identification problem. Our study also suggests that the foot comparison method is better than Vowel Onset Point comparison method for automatically identifying the main rhythm errors in the user’s speech.

[1]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Dwight L. Bolinger,et al.  Forms of English : accent, morpheme, order , 1965 .

[4]  C. D. Forgie,et al.  Automatic Recognition of Spoken Digits , 1958 .

[5]  W. Francis,et al.  The London-Lund Corpus of Spoken English: Description and Research , 1992 .

[6]  R R Munro,et al.  In Search of the Acoustic Correlates of Stress: Fundamental Frequency, Amplitude, and Duration in the Connected Utterance of Some Native and Non-Native Speakers of English , 1978, Phonetica.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[9]  Colin W. Wightman Automatic detection of prosodic constituents for parsing , 1992 .

[10]  Lou Boves,et al.  Acoustic characteristics of lexical stress in continuous speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  J. Tebelskis,et al.  Speech recognition using neural networks , 1996 .

[12]  Louis C. W. Pols,et al.  A preliminary study about robust speech recognition for a robotics application , 1997 .

[13]  D. Bolinger Two kinds of vowels, two kinds of rhythm , 1981 .

[14]  S. D. Hansen,et al.  Hidden Markov models and neural networks for speech recognition , 1999 .

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  R. M. Dauer Stress-timing and syllable-timing reanalyzed. , 1983 .

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  M. Pazzani,et al.  Concept formation knowledge and experience in unsupervised learning , 1991 .

[19]  J. Morton,et al.  Perceptual centers (P-centers). , 1976 .

[20]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[21]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[22]  Giuseppe Riccardi,et al.  THE 1994 AT&T ATIS CHRONUS RECOGNIZER , 1994 .

[23]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[24]  Michael S. Scordilis,et al.  Development and comparison of three syllable stress classifiers , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Maxine Eskénazi,et al.  Detection of foreign speakers' pronunciation errors for second language training-preliminary results , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Mengjie Zhang,et al.  Learning Models for English Speech Recognition , 2004, ACSC.

[27]  Lou Boves,et al.  The Dutch polyphone corpus , 1995, EUROSPEECH.

[28]  Richard J. Povinelli,et al.  Speech recognition using reconstructed phase space features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[30]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[31]  Peter Roach On the distinction between 'stress-timed' and 'syllable-timed' languages , 1982 .

[32]  Colin W. Wightman,et al.  The aligner: text to speech alignment using Markov models and a pronunciation dictionary , 1994, SSW.

[33]  E. Couper-Kuhlen An introduction to English prosody , 1986 .

[34]  John H. L. Hansen,et al.  Automatic segmentation of speech recorded in unknown noisy channel characteristics , 1998, Speech Communication.

[35]  Ruxin Chen,et al.  Lexical stress detection on stress-minimal word pairs , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  A. M. Aull,et al.  Lexical stress determination and its application to large vocabulary speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  K. Davis,et al.  Automatic Recognition of Spoken Digits , 1952 .

[38]  Kenneth L. Pike,et al.  Phonetics: A Critical Analysis of Phonetic Theory and a Technique for the Practical Description of Sounds , 1943 .

[39]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[40]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[41]  G D Allen,et al.  The Location of Rhythmic Stress Beats in English : an Experimental Study II , 1972, Language and speech.

[42]  John Galletly,et al.  Neural Networks: : An Introduction ‐ 2nd edition , 1998 .

[43]  Melvyn J. Hunt Signal representation , 1997 .

[44]  G. Allen The Location of Rhythmic Stress Beats in English: an Experimental Study I , 1972, Language and speech.

[45]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[46]  S. S. Stevens,et al.  Critical Band Width in Loudness Summation , 1957 .

[47]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[48]  P. Ladefoged Three areas of experimental phonetics , 1967 .

[49]  N. G. Zagoruyko,et al.  Automatic recognition of 200 words , 1970 .

[50]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[51]  Ian Maddieson,et al.  Vowels of the world''''s languages , 1990 .

[52]  Kåre Sjölander,et al.  An HMM-based system for automatic segmentation and alignment of speech , 2003 .

[53]  D. Fry Experiments in the Perception of Stress , 1958 .

[54]  Sarel van Vuuren,et al.  Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[55]  Frank Fallside,et al.  Lexical stress estimation and phonological knowledge , 1990 .

[56]  J. Forgie,et al.  Results Obtained from a Vowel Recognition Computer Program , 1959 .

[57]  P. Lieberman Some Acoustic Correlates of Word Stress in American English , 1959 .

[58]  Mark J. F. Gales,et al.  A mixture of Gaussians front end for speech recognition , 2001, INTERSPEECH.

[59]  S. Rapp Automatic Phonemic Transcription and Linguistic Annotation from Known Text with Hidden Markov Models , 1995 .

[60]  P. Ladefoged A course in phonetics , 1975 .

[61]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[62]  Lin-Shan Lee,et al.  IMPROVED PRONUNCIATION MODELING BY PROPERLY INTEGRATING BETTER APPROACHES FOR BASEFORM GENERATION , RANKING AND PRUNING , 2000 .

[63]  Berndt Müller,et al.  Neural networks: an introduction , 1990 .

[64]  Tomohiro Nakatani,et al.  Evaluation of a speech recognition / generation method based on HMM and straight , 2002, INTERSPEECH.

[65]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[66]  F. Haber Signal representation , 1965, IEEE Transactions on Communication Technology.

[67]  Mengjie Zhang,et al.  Detecting Stress in Spoken English using Decision Trees and Support Vector Machines , 2004, ACSW.

[68]  Tommi Nieminen,et al.  COUPLED OSCILLATOR MODEL OF SPEECH RHYTHM , 1999 .

[69]  K. Pike,et al.  The intonation of American English , 1946 .

[70]  J. Bernthal,et al.  Articulation and Phonological Disorders , 1988 .

[71]  John H. L. Hansen,et al.  Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition , 1998 .

[72]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[73]  Jonathan Harrington,et al.  An acoustic comparison between New Zealand and Australian English vowels , 1998 .

[74]  Aaron E. Rosenberg,et al.  Speaker-independent recognition of isolated words using clustering techniques , 1979 .

[75]  D. Abercrombie,et al.  Elements of General Phonetics , 1967 .

[76]  Doug Beeferman The Rhythm of Lexical Stress in Prose , 1996, ACL.

[77]  J. Morton,et al.  Perceptual centers (P-centers). , 1976 .

[78]  A. Prince,et al.  On stress and linguistic rhythm , 1977 .

[79]  Alex Waibel,et al.  Recognition of lexical stress in a continuous speech understanding system - A pattern recognition approach , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[80]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[81]  Aaron E. Rosenberg,et al.  Speaker independent recognition of isolated words using clustering techniques , 1979, ICASSP.

[82]  Kenneth L. Pike,et al.  Phonetics : a critical analysis of phonetic theory and a technic for the practical description of sounds / Kenneth L. Pike , 1944 .