Improved models for automatic punctuation prediction for spoken and written text

This paper presents improved models for the automatic prediction of punctuation marks in written or spoken text. Various kinds of textual features are combined using Conditional Random Fields. These features include language model scores, token n-grams, sentence length, and syntactic information extracted from parse trees. The resulting models are evaluated on several different tasks, ranging from formal newspaper text to informal, dictated messages and documents, and from written text to spoken text. The newly developed models outperform a hidden-event language model by up to 26% relative in F-score. Evaluation of punctuation prediction on erroneous ASR output as well as evaluation against multiple references is not straightforward. We propose modifications of existing evaluation methods to handle these cases.

[1]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[2]  Helena Moniz,et al.  Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[4]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[6]  Markus Freitag,et al.  Modeling punctuation prediction as machine translation , 2011, IWSLT.

[7]  Sebastian Stüker,et al.  Overview of the IWSLT 2010 evaluation campaign , 2010, IWSLT.

[8]  Lori Lamel,et al.  Development and Evaluation of Automatic Punctuation for French and English Speech-to-Text , 2012, INTERSPEECH.

[9]  Hwee Tou Ng,et al.  Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction , 2012, INTERSPEECH.

[10]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[11]  Sebastian Stüker,et al.  Overview of the IWSLT 2011 evaluation campaign , 2011, IWSLT.

[12]  Dilek Z. Hakkani-Tür,et al.  Syntactically-informed models for comma prediction , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jian Luan Expand CRF to Model Long Distance Dependencies in Prosodic Break Prediction , 2012, INTERSPEECH.