Lightly supervised alignment of subtitles on multi-genre broadcasts

This paper describes a system for performing alignment of subtitles to audio on multigenre broadcasts using a lightly supervised approach. Accurate alignment of subtitles plays a substantial role in the daily work of media companies and currently still requires large human effort. Here, a comprehensive approach to performing this task in an automated way using lightly supervised alignment is proposed. The paper explores the different alternatives to speech segmentation, lightly supervised speech recognition and alignment of text streams. The proposed system uses lightly supervised decoding to improve the alignment accuracy by performing language model adaptation using the target subtitles. The system thus built achieves the third best reported result in the alignment of broadcast subtitles in the Multi–Genre Broadcast (MGB) challenge, with an F1 score of 88.8%. This system is available for research and other non–commercial purposes through webASR, the University of Sheffield’s cloud–based speech technology web service. Taking as inputs an audio file and untimed subtitles, webASR can produce timed subtitles in multiple formats, including TTML, WebVTT and SRT.

[1]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[3]  Yu Tsao,et al.  Recurrent Neural Network Based Personalized Language Modeling by Social Network Crowdsourcing , 2013 .

[4]  Thomas Hain,et al.  Web-Based Automatic Speech Recognition Service - webASR , 2011, INTERSPEECH.

[5]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[6]  Raymond W. M. Ng,et al.  webASR 2 - Improved Cloud Based Speech Technology , 2016, INTERSPEECH.

[7]  Mark J. F. Gales,et al.  The development of the cambridge university alignment systems for the multi-genre broadcast challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[9]  Jithendra Vepa,et al.  The segmentation of multi-channel meeting recordings for automatic speech recognition , 2006, INTERSPEECH.

[10]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[11]  Henrik Schulz,et al.  Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..

[12]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Thomas Hain,et al.  Automatic speech recognition for scientific purposes - webASR , 2008, INTERSPEECH.

[14]  Raymond W. M. Ng,et al.  The 2015 sheffield system for longitudinal diarisation of broadcast media , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  Oscar Saz-Torralba,et al.  Combining Feature and Model-Based Adaptation of RNNLMs for Multi-Genre Broadcast Speech Recognition , 2016, INTERSPEECH.

[16]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[17]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Raymond W. M. Ng,et al.  The 2015 sheffield system for transcription of Multi-Genre Broadcast media , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Jindrich Matousek,et al.  Improving automatic dubbing with subtitle timing optimisation using video cut detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Oliver Watts,et al.  ALISA: An automatic lightly supervised speech segmentation and alignment tool , 2016, Comput. Speech Lang..

[22]  Yu Tsao,et al.  Recurrent neural network based language model personalization by social network crowdsourcing , 2013, INTERSPEECH.

[23]  Keikichi Hirose,et al.  WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding , 2012, FSMNLP.

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[25]  Thomas Hain,et al.  Semi-supervised DNN training in meeting recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Mark J. F. Gales,et al.  Improving lightly supervised training for broadcast transcription , 2013, INTERSPEECH.

[27]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[28]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[29]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[31]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[32]  Carlo Aliprandi,et al.  Automating live and batch subtitling of multimedia contents for several European languages , 2015, Multimedia Tools and Applications.

[33]  Marco Furini,et al.  An automatic caption alignment mechanism for off-the-shelf speech recognition technologies , 2012, Multimedia Tools and Applications.

[34]  Thomas Hain,et al.  Making an Automatic Speech Recognition Service Freely Available on the Web , 2011, INTERSPEECH.

[35]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[36]  Yongqiang Wang,et al.  Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[37]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[38]  Oscar Saz-Torralba,et al.  Error Correction in Lightly Supervised Alignment of Broadcast Subtitles , 2016, INTERSPEECH.

[39]  Jesper Ø. Olsen ICSLP'98 : Proceedings of the 5th International Conference on Spoken Language Processing, November 30-December 4, 1998, Sydney, Australia , 1998 .

[40]  Luis Javier Rodríguez-Fuentes,et al.  A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions , 2012, INTERSPEECH.

[41]  Mark J. F. Gales,et al.  Improving Lightly Supervised Training for Broadcast Transcriptions , 2013, ISCA 2013.