An Automatic Real-time Synchronization of Live speech with Its Transcription Approach

Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books.

[1]  John R. Kender,et al.  Alignment of Speech to Highly Imperfect Text Transcriptions , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[2]  H. Isono,et al.  Real-time transcription system for simultaneous subtitling of Japanese broadcast news programs , 2000 .

[3]  Haizhou Li,et al.  Syllabic level automatic synchronization of music signals and text lyrics , 2006, MM '06.

[4]  Yonghong Yan,et al.  Automatic Synchronization of live speech and its Transcripts based on a frame-synchronous likelihood ratio test , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Nuria Oliver,et al.  Automatic synchronization of electronic and audio books via TTS alignment and silence filtering , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[6]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[7]  Susanne Burger,et al.  Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Alfonso Ortega,et al.  Audio and text synchronization for TV news subtitling based on Automatic Speech Recognition , 2009, 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting.

[9]  H. Ney,et al.  AUTOMATIC TRANSCRIPTION VERIFICATION OF BROADCAST NEWS AND SIMILAR SPEECH CORPORA , 1999 .

[10]  Virach Sornlertlamvanich,et al.  Automatic Sentence Break Disambiguation for Thai , 2001 .

[11]  Atiwong Suchato,et al.  Real-time synchronization of live speech with its transcription , 2013, 2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[12]  Ye Wang,et al.  LyricAlly: automatic synchronization of acoustic musical signals and textual lyrics , 2004, MULTIMEDIA '04.

[13]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[14]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[15]  A. Suchato,et al.  Broad phonetic class segmentation study for Thai automatic speech recognition , 2012, 2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[16]  Virach Sornlertlamvanich,et al.  Thai Speech Corpus for Speech Recognition , 2003 .

[17]  Carol Y. Espy-Wilson,et al.  Acoustic analysis and modeling of speech based on phonetic features , 1998 .

[18]  Frank Kurth,et al.  SyncTS: Automatic Synchronization of Speech and Text Documents , 2011, Semantic Audio.

[19]  Herng-Yow Chen,et al.  How speech/text alignment benefits web-based learning , 2005, MULTIMEDIA '05.

[20]  A. Juneja,et al.  Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[21]  Kenneth N. Stevens,et al.  Automatic syllable detection for vowel landmarks , 2000 .

[22]  Natthawut Kertkeidkachorn,et al.  ChulaDAISY: an automated DAISY audio book generation , 2012 .