APyCA: Towards the automatic subtitling of television content in Spanish

Automatic subtitling of television content has become an approachable challenge due to the advancement of the technology involved. In addition, it has also become a priority need for many Spanish TV broadcasters, who will have to broadcast up to 90% of subtitled content by 2013 to comply with recently approved national audiovisual policies. APyCA, the prototype system described in this paper, has been developed in an attempt to automate the process of subtitling television content in Spanish through the application of state-of-the-art speech and language technologies. Voice activity detection, automatic speech recognition and alignment, discourse segment detection and speaker diarization have proved to be useful to generate time-coded colour-assigned draft transcriptions for post-editing. The productive benefit of the followed approach heavily depends on the performance of the speech recognition module, which achieves reasonable results on clean read speech but degrades as this becomes more noisy and/or spontaneous.

[1]  Jing Huang,et al.  The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings , 2007, CLEAR.

[2]  Jeih-Weih Hung,et al.  Robust entropy-based endpoint detection for speech recognition in noisy environments , 1998, ICSLP.

[3]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[4]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[5]  Ji-Hwan Kim,et al.  The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[6]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[8]  Hermann Ney,et al.  Towards automatic learning in LVCSR: rapid development of a Persian broadcast transcription system , 2008, INTERSPEECH.

[9]  Fernando Batista,et al.  Comparing automatic rich transcription for Portuguese, Spanish and English Broadcast News , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  S. A. Mahmoud,et al.  A voice activity detection algorithm for communication systems with dynamically varying background acoustic noise , 1998, VTC '98. 48th IEEE Vehicular Technology Conference. Pathway to Global Wireless Revolution (Cat. No.98CH36151).

[11]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[12]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[13]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[14]  Haizhou Li,et al.  Speaker diarization in meeting audio , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Andreas Stolcke,et al.  The ICSI-SRI-UW metadata extraction system , 2004, INTERSPEECH.

[16]  Georg Heigold,et al.  The RWTH 2007 TC-STAR evaluation system for european English and Spanish , 2007, INTERSPEECH.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.