Text spotting in large speech databases for under-resourced languages

Lightly supervised acoustic modeling in under-resourced languages raises new issues due to the poor accuracy of Automatic Speech Recognition (ASR) systems for such languages and the quality of the speech transcriptions that may be found. In these conditions, the common alignment techniques are not always capable of aligning the ASR output and the approximate transcription. We propose two aligning methods that overcome these issues. In the first approach we apply an image processing algorithm on the matching matrix of the two texts to be aligned, while the second alignment approach is based on segmental DTW. The approaches outperform the current Dynamic Time Warping technique (DTW) by extracting in average 29% and 27% respectively more speech data than the currently used DTW.

[1]  Laurent Besacier,et al.  First steps in fast acoustic modeling for a new target language: application to Vietnamese , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Horia Cucu,et al.  Investigating the role of machine translated text in ASR domain adaptation: Unsupervised and semi-supervised methods , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  Georges Linarès,et al.  Imperfect transcript driven speech recognition , 2006, INTERSPEECH.

[4]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Peng Liu,et al.  Cross-lingual speech recognition under runtime resource constraints , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[7]  Jorge Herbert de Lira,et al.  Two-Dimensional Signal and Image Processing , 1989 .

[8]  Jean-Luc Gauvain,et al.  Investigating lightly supervised acoustic model training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Horia Cucu,et al.  ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection , 2012, MediaEval.

[10]  Bing Xiang,et al.  Light supervision in acoustic model training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[12]  Georges Linarès,et al.  Text island spotting in large speech databases , 2007, INTERSPEECH.

[13]  Horia Cucu,et al.  Enhancing Automatic Speech Recognition for Romanian by Using Machine Translated and Web-based Text Corpora , 2011 .

[14]  Etienne Barnard,et al.  Efficient Harvesting of Internet Audio for Resource-Scarce ASR , 2011, INTERSPEECH.