Acoustic landmark detection and segmentation using the McAulay-Quatieri Sinusoidal Model

The current method for phonetic landmark detection in the Spoken Language Systems Group at MIT is performed by summit, a segment-based speech recognition system. Under noisy conditions the system’s segmentation algorithm has difficulty distinguishing between noise and speech components and often produces a poor alignment of sounds. Noise robustness in summit can be improved using a full segmentation method, which allows landmarks at regularly spaced intervals. While this approach is computationally more expensive than the original segmentation method, it is more robust under noisy environments. In this thesis, we explore a landmark detection and segmentation algorithm using the McAulay-Quatieri Sinusoidal Model, in hopes of improving the performance of the recognizer in noisy conditions. We first discuss the sinusoidal model representation, in which rapid changes in spectral components are tracked using the concept of “birth” and “death” of underlying sinewaves. Next, we describe our method of landmark detection with respect to the behavior of sinewave tracks generated from this model. These landmarks are interconnected together to form a graph of hypothetical segments. Finally, we experiment with different segmentation algorithms to reduce the size of the segment graph. We compare the performance of our approach with the full and original segmentation methods under different noise environments. The word error rate of original segmentation model degrades rapidly in the presence of noise, while the sinusoidal and full segmentation models degrade more gracefully. However, the full segmentation method has the largest computation time compared to original and sinusoidal methods. We find that our algorithm provides the best tradeoff between word accuracy and computation time of the three methods. Furthermore, we find that our model is robust when speech is contaminated by white noise, speech babble noise and destroyer operations room noise. Thesis Supervisor: Timothy J. Hazen Title: Research Scientist, CSAIL

[1]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[2]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..

[3]  H. V. Trees Detection, Estimation, And Modulation Theory , 2001 .

[4]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[5]  Saeed Vaseghi,et al.  Noise compensation methods for hidden Markov model speech recognition in adverse environments , 1997, IEEE Trans. Speech Audio Process..

[6]  Lawrence R. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1975, Bell Syst. Tech. J..

[7]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[8]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Aaron E. Rosenberg,et al.  Some comparisons among several pitch detection algorithms , 1976, ICASSP.

[10]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[12]  Harry L. Van Trees,et al.  Detection, Estimation, and Modulation Theory, Part I , 1968 .

[13]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[14]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[15]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[16]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[17]  Lippold Haken,et al.  Extending the McAulay-Quatieri Analysis for Synthesis with a Limited Number of Oscillators , 1995 .

[18]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[19]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[20]  Thierry Dutoit,et al.  Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[21]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[22]  Julius O. Smith,et al.  Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition , 1990 .

[23]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[24]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[25]  E. Bryan George An analysis-by-synthesis approach to sinusoidal modeling applied to speech and music signal processing , 1991 .

[26]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[27]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[28]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[29]  James R. Glass,et al.  Real-time probabilistic segmentation for segment-based speech recognition , 1998, ICSLP.

[30]  James R. Glass Finding acoustic regularities in speech: applications to phonetic recognition , 1988 .

[31]  Eric Moulines,et al.  High-quality speech modification based on a harmonic + noise model , 1995, EUROSPEECH.