Automatic Phone Alignment - A Comparison between Speaker-Independent Models and Models Trained on the Corpus to Align

Several automatic phonetic alignment tools have been proposed in the literature. They generally use speaker-independent acoustic models of the language to align new corpora. The problem is that the range of provided models is limited. It does not cover all languages and speaking styles (spontaneous, expressive, etc.). This study investigates the possibility of directly training the statistical model on the corpus to align. The main advantage is that it is applicable to any language and speaking style. Moreover, comparisons indicate that it provides as good or better results than using speaker-independent models of the language. It shows that about 2% are gained, with a 20 ms threshold, by using our method. Experiments were carried out on neutral and expressive corpora in French and English. The study also points out that even a small neutral corpus of a few minutes can be exploited to train a model that will provide high-quality alignment.

[1]  Jordi Adell,et al.  Comparative study of automatic phone segmentation methods for TTS , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Florian Schiel,et al.  The Production of Speech Corpora , 2012 .

[3]  Luis A. Hernández Gómez,et al.  HMMs for Automatic Phonetic Segmentation , 2002, LREC.

[4]  Piero Cosi,et al.  A preliminary statistical evaluation of manual and automatic segmentation discrepancies , 1991, EUROSPEECH.

[5]  Mary P. Harper,et al.  Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus , 2004, LREC.

[6]  Vincent Colotte,et al.  Linguistic features weighting for a text-to-speech system without prosody model , 2005, INTERSPEECH.

[7]  Daniel Hirst,et al.  SPeech Phonetization Alignment and Syllabification (SPPAS): a tool for the automatic analysis of speech prosody , 2012 .

[8]  Tomoki Toda,et al.  An evaluation of automatic phone segmentation for concatenative speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Bogdan Ludusan,et al.  Automatic speech segmentation for Italian: tools, models, evaluation, and applications , 2011 .

[10]  Jean-Philippe Goldman,et al.  EasyAlign: An Automatic Phonetic Alignment Tool Under Praat , 2011, INTERSPEECH.

[11]  Frank Dellaert,et al.  Recognizing emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[13]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[14]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[15]  Etienne Barnard,et al.  Phonetic alignment for speech synthesis in under-resourced languages , 2009, INTERSPEECH.

[16]  Andrej Ljolje,et al.  Automatic speech segmentation for concatenative inventory selection , 1994, SSW.