Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches

In this paper we investigate the punctuated transcription of multi-genre broadcast media. We examine four systems, three of which are based on lexical features, the fourth of which uses acoustic features by integrating punctuation into the speech recognition acoustic models. We also explore the combination of these component systems using voting and log-linear interpolation. We performed experiments on the English language MGB Challenge data, which comprises about 1,600h of BBC television recordings. Our results indicate that a lexical system, based on a neural machine translation approach is significantly better than other systems achieving an F-Measure of 62.6% on reference text, with a relative degradation of 19% on ASR output. Our analysis of the results in terms of specific punctuation indicated that using longer context improves the prediction of question marks and acoustic information improves prediction of exclamation marks. Finally, we show that even though the systems are complementary, their straightforward combination does not yield better F-measures than a single system using neural machine translation.

[1]  Josef Psutka,et al.  Automatic punctuation annotation in czech broadcast news speech , 2004 .

[2]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[4]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[5]  Ji-Hwan Kim,et al.  The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[6]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[7]  Rico Sennrich,et al.  Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[8]  Nicola Ueffing,et al.  Improved models for automatic punctuation prediction for spoken and written text , 2013, INTERSPEECH.

[9]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[10]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[11]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[15]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[16]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Lori Lamel,et al.  Development and Evaluation of Automatic Punctuation for French and English Speech-to-Text , 2012, INTERSPEECH.

[21]  Markus Freitag,et al.  Modeling punctuation prediction as machine translation , 2011, IWSLT.

[22]  Jan Niehues,et al.  Combination of NN and CRF models for joint detection of punctuation and disfluencies , 2015, INTERSPEECH.

[23]  Jan Niehues,et al.  Segmentation and punctuation prediction in speech language translation using a monolingual translation system , 2012, IWSLT.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Thomas Hain,et al.  Noise-matched training of CRF based sentence end detection models , 2015, INTERSPEECH.

[26]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[27]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[28]  Joris Driesen,et al.  Automated production of true-cased punctuated subtitles for weather and news broadcasts , 2014, INTERSPEECH.

[29]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[30]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[32]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[33]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.