Using automatic alignment to analyze endangered language data: testing the viability of untrained alignment.

While efforts to document endangered languages have steadily increased, the phonetic analysis of endangered language data remains a challenge. The transcription of large documentation corpora is, by itself, a tremendous feat. Yet, the process of segmentation remains a bottleneck for research with data of this kind. This paper examines whether a speech processing tool, forced alignment, can facilitate the segmentation task for small data sets, even when the target language differs from the training language. The authors also examined whether a phone set with contextualization outperforms a more general one. The accuracy of two forced aligners trained on English (hmalign and p2fa) was assessed using corpus data from Yoloxóchitl Mixtec. Overall, agreement performance was relatively good, with accuracy at 70.9% within 30 ms for hmalign and 65.7% within 30 ms for p2fa. Segmental and tonal categories influenced accuracy as well. For instance, additional stop allophones in hmalign's phone set aided alignment accuracy. Agreement differences between aligners also corresponded closely with the types of data on which the aligners were trained. Overall, using existing alignment systems was found to have potential for making phonetic analysis of small corpora more efficient, with more allophonic phone sets providing better agreement than general ones.

[1]  Martine Adda-Decker,et al.  Quantifying temporal speech reduction in French using forced speech alignment , 2011, J. Phonetics.

[2]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[5]  Florian Metze,et al.  Subword Modeling for Automatic Speech Recognition: Past, Present, and Emerging Approaches , 2012, IEEE Signal Processing Magazine.

[6]  Olivier Rosec,et al.  A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis , 2008, Speech Commun..

[7]  Mei-Yuh Hwang,et al.  Incorporating tone-related MLP posteriors in the feature representation for Mandarin ASR , 2005, INTERSPEECH.

[8]  Thierry Dutoit,et al.  Phonetic alignment: speech synthesis-based vs. Viterbi-based , 2003, Speech Commun..

[9]  John-Paul Hosom,et al.  Speaker-independent phoneme alignment using transition-dependent states , 2009, Speech Commun..

[10]  Janet B. Pierrehumbert,et al.  Papers in Laboratory Phonology: The timing of prenuclear high accents in English , 1990 .

[11]  Marlys A. Macken,et al.  Prosodic Templates in Sound Change , 1997 .

[12]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Marie K. Huffman Segmental and prosodic effects on coda glottalization , 2005, J. Phonetics.

[14]  Yi Xu,et al.  Information for Mandarin tones in the amplitude contour and in brief segments , 1990 .

[15]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[16]  Gitta P. M. Laan The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and a read speaking style , 1997, Speech Commun..

[17]  Etienne Barnard,et al.  Collecting and evaluating speech recognition corpora for 11 South African languages , 2011, Lang. Resour. Evaluation.

[18]  Kim E. A. Silverman,et al.  The timing of prenuclear high accents in English , 1987 .

[19]  Mark Liberman,et al.  Investigating /l/ variation in English through forced alignment , 2009, INTERSPEECH.

[20]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[21]  Gary F. Simons,et al.  Endangered language families , 2012 .

[22]  Haizhou Li,et al.  Context dependant phone mapping for cross-lingual acoustic modeling , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[23]  R L Diehl,et al.  On the Role of Perception in Shaping Phonological Assimilation Rules , 1992, Language and speech.

[24]  Justus C. Roux,et al.  Data-driven approach to rapid prototyping Xhosa speech synthesis , 2007, SSW.

[25]  Netra P. Paudyal,et al.  Free Prefix Ordering in Chintang , 2007 .

[26]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[27]  Haizhou Li,et al.  Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition , 2008, INTERSPEECH.

[28]  Martine Adda-Decker,et al.  MULTI-LINGUAL AUTOMATIC PHONEME CLUSTERING , 1999 .

[29]  J. Ohala Papers in Laboratory Phonology: The phonetics and phonology of aspects of assimilation , 1990 .

[30]  Haizhou Li,et al.  Robust phone set mapping using decision tree clustering for cross-lingual phone recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  A. Cutler,et al.  Detection of Target Phonemes in Spontaneous and Read Speech , 1988, Language and speech.

[32]  D H Whalen,et al.  Information for Mandarin Tones in the Amplitude Contour and in Brief Segments , 1990, Phonetica.

[33]  R. Krakow,et al.  Perception of coarticulatory nasalization by speakers of English and Thai: evidence for partial compensation. , 1999, The Journal of the Acoustical Society of America.

[34]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[35]  Paul Dalsgaard,et al.  Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[36]  Hervé Bourlard,et al.  Boosting under-resourced speech recognizers by exploiting out-of-language data - case study on Afrikaans , 2012, SLTU.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[38]  Ailbhe Ní Chasaide,et al.  Speech technology for minority languages: the case of Irish (gaelic) , 2006, INTERSPEECH.

[39]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[40]  Gary F. Simons,et al.  The world’s languages in crisis , 2013 .

[41]  Lukás Burget,et al.  Region dependent linear transforms in multilingual speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[43]  Jyh-Shing Roger Jang,et al.  Automatic Segmentation and Labeling for Mandarin Chinese Speech Corpora for Concatenation-based TTS , 2005, ROCLING/IJCLCLP.

[44]  John W. Du Bois The Discourse Basis of Ergativity , 1987 .

[45]  Daniel Jones,et al.  The Phonetics of Russian , 1969 .

[46]  R. Harald Baayen,et al.  Analyzing linguistic data: a practical introduction to statistics using R, 1st Edition , 2008 .

[47]  Keikichi Hirose,et al.  Temporal rate change of dialogue speech in prosodic units as compared to read speech , 2002, Speech Commun..