Phonetic alignment: speech synthesis-based vs. Viterbi-based

In this paper we compare two different methods for automatically phonetically labeling a continuous speech data-base, as usually required for designing a speech recognition or speech synthesis system. The first method is based on temporal alignment of speech on a synthetic speech pattern; the second method uses either a continuous density hidden Markov models (HMM) or a hybrid HMM/ANN (artificial neural network) system in forced alignment mode. Both systems have been evaluated on read utterances not part of the training set of the HMM systems, and compared to manual segmentation. This study outlines the advantages and drawbacks of both methods. The speech synthetic system has the great advantage that no training stage (hence no large labeled database) is needed, while HMM Systems easily handle multiple phonetic transcriptions (phonetic lattice). We deduce a method for the automatic creation of large phonetically labeled speech databases, based on using the synthetic speech segmentation tool to bootstrap the training process of either a HMM or a hybrid HMM/ANN system. The importance of such segmentation tools is a key point for the development of improved multilingual speech synthesis and recognition systems.

[1]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[2]  Jean-Marc Boite,et al.  Context independent and context dependent hybrid HMM/ANN systems for vocabulary independent tasks , 1997, EUROSPEECH.

[3]  Hynek Hermansky,et al.  Integrating RASTA-PLP into speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  M. Eskenazi,et al.  The French language database: Defining, planning, and recording a large database , 1984, ICASSP.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  Horacio Franco,et al.  Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system , 1994, Comput. Speech Lang..

[9]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[10]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[11]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[12]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[14]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Piero Cosi,et al.  A preliminary statistical evaluation of manual and automatic segmentation discrepancies , 1991, EUROSPEECH.

[16]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[17]  Olivier Deroo,et al.  Automatic detection and correction of pronunciation errors for foreign language learners: the demosthenes application , 1999, EUROSPEECH.

[18]  Frank Fallside,et al.  A recurrent error propagation network speech recognition system , 1991 .

[19]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[20]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[21]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[22]  Alan W. Black,et al.  Diphone collection and synthesis , 2000, INTERSPEECH.

[23]  Steve Young,et al.  Spoken language systems technology workshop , 1995 .

[24]  Mark Huckvale,et al.  Improvements in Speech Synthesis , 2001 .

[25]  S. M. Peeling,et al.  The ARM continuous speech recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[26]  Bert Van Coile,et al.  PROTRAN: a prosody transplantation tool for text-to-speech applications , 1994, ICSLP.

[27]  Lawrence R. Rabiner,et al.  Connected word recognition using a level building dynamic time warping algorithm , 1981, ICASSP.

[28]  Steve Renals,et al.  The 1994 Abbot hybrid connectionist-HMM large vocabulary recognition system. , 1995 .

[29]  Victor Zue,et al.  A procedure for automatic alignment of phonetic transcriptions with continuous speech , 1984, ICASSP.

[30]  Colin W. Wightman,et al.  The aligner: text to speech alignment using Markov models and a pronunciation dictionary , 1994, SSW.

[31]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[32]  Thierry Dutoit,et al.  High-quality speech synthesis for phonetic speech segmentation , 1997, EUROSPEECH.

[33]  Michael Riley,et al.  Automatic segmentation and labeling of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[34]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.