Long audio alignment for automatic subtitling using different phone-relatedness measures

In this work, long audio alignment systems for Spanish and English are presented in an automatic subtitling scenario. Pre-recorded contents are automatically recognized at phoneme level by language-dependent phone decoders. A dynamic-programming alignment algorithm finds matches between the automatically decoded phones and the ones in the phonetic transcription for the content's script. The accuracy of the alignment algorithm is evaluated when applying three non-binary scoring matrices based on phone confusion-pairs from each phone decoder, on phonological similarity and on human perception errors. Alignment results with the three continuous-score matrices are compared to results with a baseline binary matrix, at word and subtitle levels. The non-binary matrices achieved clearly better results. Matrix samples are given in the project's website.

[1]  Antonio Rubio,et al.  ALBAYZIN: a task-oriented spanish speech corpus , 1998 .

[2]  Pedro J. Moreno,et al.  A factor automaton approach for the forced alignment of long speech recordings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  R. Smits,et al.  Patterns of English phoneme confusions by native and non-native listeners. , 2004, The Journal of the Acoustical Society of America.

[4]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[5]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[6]  P. Ladefoged A course in phonetics , 1975 .

[7]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[8]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[9]  Comas Umbert,et al.  Factoid question answering for spoken documents , 2012 .

[10]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[11]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[12]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[13]  Luis Javier Rodríguez-Fuentes,et al.  A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions , 2012, INTERSPEECH.

[14]  Jean Véronis,et al.  A multilingual prosodic database , 1998, ICSLP.