Á bilingual comparison of MaxEnt-and RNN-based punctuation restoration in speech transcripts

Closed captioning is a common method to improve accessibility of TV programs for people who are hearing impaired or hard of hearing, while representing an application relevant for cognitive infocommunication. However, live captions provided by automatic speech recognition systems usually lack punctuation, making them hard to follow. In this paper, Maximum Entropy and Recurrent Neural Network based punctuation restoration models are compared on two closed captioning tasks in real-time and off-line setups. We present the first results in restoring punctuation for Hungarian broadcast speech, where the RNN significantly outperforms our MaxEnt baseline system. Our approach is also evaluated on TED talks within the IWSLT English dataset providing comparable results to the state-of-the-art systems.

[1]  Fernando Batista,et al.  Recovering Capitalization and Punctuation Marks on Speech Transcriptions , 2011 .

[2]  Gyorgy Szaszak,et al.  Combining NLP techniques and acoustic analysis for semantic focus detection in speech , 2014, 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom).

[3]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[4]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Christoph Meinel,et al.  Punctuation Prediction for Unsegmented Transcript Based on Word Vector , 2016, LREC.

[7]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[8]  Steve Renals,et al.  Just-in-time prepared captioning for live transmissions , 2016 .

[9]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[10]  György Kovács,et al.  Topical unit classification using deep neural nets and probabilistic sampling , 2016, 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[11]  Klara Vicsi,et al.  Thinking about the present and future of the complex speech recognition , 2012, 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom).

[12]  Shachar Mirkin,et al.  Joint Learning of Correlated Sequence Labeling Tasks Using Bidirectional Recurrent Neural Networks , 2017, INTERSPEECH.

[13]  György Szaszák,et al.  Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach , 2015, SPECOM.

[14]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[15]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[16]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[17]  Dániel Varga,et al.  A Hungarian NP Chunker , 2010 .

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[20]  P. Baranyi,et al.  Definition and synergies of cognitive infocommunications , 2012 .

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[24]  Graham Wilcock,et al.  Towards computer-assisted language learning with robots, wikipedia and CogInfoCom , 2015, 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).