Technique for automatic sentence level alignment of long speech and transcripts

A frugal approach to construct speech corpora, specially for resource deficient languages, is to exploit collections of speech and corresponding text data available in audio books, news, lectures. However, using these resources for building speech corpora require an alignment of the long duration speech data with the accompanying text data. Existing techniques for automatic speech-text alignment of long audio files assume availability of a basic speech recognition engine and hence cannot be directly used for resource deficient languages. In this paper, we propose a novel technique for sentence level alignment of long speechtext data by exploiting the syllable information in speech and text data. The proposed technique does not depend on the availability of any speech recognition models and hence can be used for resource deficient languages.

[1]  Nivja H. Jong,et al.  Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[2]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[3]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[4]  Kishore Prahallad,et al.  Segmentation of Monologues in Audio Books for Building Synthetic Voices , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[6]  Chih-Wei Huang Automatic Closed Caption Alignment Based on Speech Recognition Transcripts , 2003 .

[7]  A. Imran,et al.  Speech recognition for resource deficient languages using frugal speech corpus , 2012, 2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012).

[8]  Patrick Paroubek,et al.  Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts , 2004, LREC.

[9]  Kishore Prahallad,et al.  Unit size in unit selection speech synthesis , 2003, INTERSPEECH.

[10]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[11]  Luís Carriço,et al.  Spoken language technologies applied to digital talking books , 2006, INTERSPEECH.

[12]  Ye Tao,et al.  A dynamic alignment algorithm for imperfect speech and transcript , 2010, Comput. Sci. Inf. Syst..

[13]  Etienne Barnard,et al.  Efficient Harvesting of Internet Audio for Resource-Scarce ASR , 2011, INTERSPEECH.