Recovering Capitalization and Punctuation Marks on Speech Transcriptions

This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of written corpora better, while the discriminative approaches are robust to ASR errors and suitable for dealing with speech transcripts. The so-called language dynamics has been addressed, and results indicate that the capitalization performance is affected by the temporal distance between the training and testing data. In what concerns the punctuation task, this study covers the three most frequent marks: full stop, comma, and question mark, combining lexical, acoustic, and prosodic information. Much of the research described here is language independent, but a special focus is given to the Portuguese language. This work provides the first evaluation results of these two tasks over European Portuguese broadcast news data.

[1]  Dilek Z. Hakkani-Tür,et al.  Cross-Genre Feature Comparisons for Spoken Sentence Segmentation , 2007, Int. J. Semantic Comput..

[2]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[3]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Hermann Ney,et al.  Automatic sentence segmentation and punctuation prediction for spoken language translation , 2006, IWSLT.

[5]  Helena Gorete,et al.  CONTRIBUTO PARA A CARACTERIZAÇÃO DOS MECANISMOS DE (DIS)FLUÊNCIA NO PORTUGUÊS EUROPEU , 2006 .

[6]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[7]  Andreas Stolcke,et al.  Structural metadata research in the EARS program , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Dilek Z. Hakkani-Tür,et al.  Speech segmentation and spoken document processing , 2008, IEEE Signal Processing Magazine.

[9]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .

[10]  Pilar Vázquez Cuesta,et al.  Gramática da língua portuguesa , 1971 .

[11]  Helena Moniz,et al.  Recognition of classroom lectures in european portuguese , 2006, INTERSPEECH.

[12]  Stuart M. Shieber,et al.  Comma Restoration Using Constituency Information , 2003, HLT-NAACL.

[13]  Eric W. Brown,et al.  Capitalization Recovery for Text , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[14]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[15]  Shrikanth S. Narayanan,et al.  A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[17]  Ciro Martins,et al.  Vocabulary selection for a broadcast news transcription system using a morpho-syntactic approach , 2007, INTERSPEECH.

[18]  Andrei Mikheev A Knowledge-free Method for Capitalized Word Disambiguation , 1999, ACL.

[19]  Jacqueline Vaissière,et al.  Language-Independent Prosodic Features , 1983 .

[20]  Helena Moniz,et al.  Extending the punctuation module for european portuguese , 2010, INTERSPEECH.

[21]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[22]  João Paulo da Silva Neto,et al.  Statistical Machine Translation of Broadcast News from Spanish to Portuguese , 2008, PROPOR.

[23]  Isabel Trancoso,et al.  Topic segmentation and indexation in a media watch system , 2008, INTERSPEECH.

[24]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[25]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[26]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[27]  João Paulo da Silva Neto,et al.  Incorporating acoustical modelling of phone transitions in an hybrid ANN/HMM speech recognizer , 2008, INTERSPEECH.

[28]  Arnav Khare Joint Learning for Named Entity Recognition and Capitalization Generation , 2006 .

[29]  Dilek Z. Hakkani-Tür,et al.  The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[30]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[31]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[32]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[33]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[34]  Ji-Hwan Kim,et al.  The use of prosody in a combined system for punctuation generation and speech recognition , 2001, INTERSPEECH.

[35]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[36]  Fernando Batista,et al.  Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions , 2008, TSD.

[37]  Ralph Grishman,et al.  Updating a Name Tagger Using Contemporary Unlabeled Data , 2009, ACL/IJCNLP.

[38]  Richard M. Schwartz,et al.  The effects of speech recognition and punctuation on information extraction performance , 2005, INTERSPEECH.

[39]  Andreas Stolcke,et al.  Two experiments comparing reading with listening for human processing of conversational telephone speech , 2005, INTERSPEECH.

[40]  Martin Raab,et al.  The ISL TC-STAR Spring 2006 ASR Evaluation Systems , 2006 .

[41]  Isabel Trancoso,et al.  A SYSTEM FOR SELECTIVE DISSEMINATION OF MULTIMEDIA INFORMATION RESULTING FROM THE ALERT PROJECT , 2003 .

[42]  Marcello Federico,et al.  Punctuating confusion networks for speech translation , 2007, INTERSPEECH.

[43]  Ricardo Ribeiro,et al.  Mixed-Source Multi-Document Speech-to-Text Summarization , 2008, COLING 2008.

[44]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[45]  George F. Foster,et al.  Truecasing For The Portage System , 2005 .

[46]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[47]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[49]  Ralph Grishman,et al.  Is this NE tagger getting old? , 2008, LREC.

[50]  Klaus Zechner,et al.  Automatic Summarization of Open-Domain Multiparty Dialogues in Diverse Genres , 2002, CL.

[51]  Andreas Stolcke,et al.  The ICSI-SRI-UW metadata extraction system , 2004, INTERSPEECH.

[52]  Dilek Z. Hakkani-Tür,et al.  Syntactically-informed models for comma prediction , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Fernando Batista,et al.  Impact of dynamic model adaptation beyond speech recognition , 2008, 2008 IEEE Spoken Language Technology Workshop.

[54]  Dustin Hillard,et al.  SCORING STRUCTURAL MDE: TOWARDS MORE MEANINGFUL ERROR RATES , 2004 .

[55]  João Paulo da Silva Neto,et al.  Audio segmentation, classification and clustering in a broadcast news task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[56]  Eleonora Blaauw,et al.  On the perceptual classification of spontaneous and read speech , 1995 .

[57]  Marti A. Hearst,et al.  Adaptive Sentence Boundary Disambiguation , 1994, ANLP.

[58]  Ciro Martins,et al.  Dynamic language modeling for a daily broadcast news transcription system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[59]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[60]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[61]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[62]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[63]  Mari Ostendorf,et al.  Detecting Structural Metadata with Decision Trees and Transformation-Based Learning , 2004, HLT-NAACL.

[64]  Fernando Batista,et al.  Language Dynamics and Capitalization using Maximum Entropy , 2008, ACL.

[65]  Mari Ostendorf,et al.  Parsing Conversational Speech Using Enhanced Segmentation , 2004, NAACL.

[66]  Fernando Batista,et al.  Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news , 2008, Speech Commun..

[67]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[68]  Helena Moniz,et al.  Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[69]  Dilek Z. Hakkani-Tür,et al.  Prosodic Similarities of Dialog Act Boundaries Across Speaking Styles , 2008 .

[70]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[71]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[72]  Fernando Batista,et al.  Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages , 2009 .

[73]  Ana Isabel Mata,et al.  Prosodic Phrasing: Machine and Human Evaluation , 2001, Int. J. Speech Technol..

[74]  Ilkay Ulusoy,et al.  Generative versus discriminative methods for object recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[75]  Helena Moniz,et al.  The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese , 2008, LREC.

[76]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[77]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[78]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[79]  Fernando Batista,et al.  The impact of language dynamics on the capitalization of broadcast news , 2008, INTERSPEECH.

[80]  Elizabeth Shriberg,et al.  Comparing Evaluation Metrics for Sentence Boundary Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[81]  Stephanie Strassel,et al.  Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium , 2004, LREC.

[82]  Dilek Z. Hakkani-Tür,et al.  Any questions? Automatic question detection in meetings , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[83]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[84]  Ji-Hwan Kim,et al.  Automatic capitalisation generation for speech input , 2004, Comput. Speech Lang..

[85]  Cheng Niu,et al.  Orthographic case restoration using supervised learning without manual annotation , 2004, Int. J. Artif. Intell. Tools.

[86]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[87]  Ji-Hwan Kim,et al.  A combined punctuation generation and speech recognition system and its performance enhancement using prosody , 2003, Speech Commun..

[88]  Dilek Z. Hakkani-Tür,et al.  Efficient sentence segmentation using syntactic features , 2008, 2008 IEEE Spoken Language Technology Workshop.

[89]  Fernando Batista,et al.  Recovering punctuation marks for automatic speech recognition , 2007, INTERSPEECH.

[90]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[91]  H. H. Clark,et al.  Using uh and um in spontaneous speaking , 2002, Cognition.

[92]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[93]  Joakim Gustafson,et al.  Web-based educational tools for speech technology , 1998, ICSLP.

[94]  Douglas A. Reynolds,et al.  Measuring human readability of machine generated text: three case studies in speech recognition and machine translation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[95]  Daniel Marcu,et al.  Capitalizing Machine Translation , 2006, NAACL.

[96]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[97]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[98]  Tatsuya Kawahara,et al.  Transcription and Distillation of Spontaneous Speech , 2008 .

[99]  Ricardo Ribeiro,et al.  Using Morphossyntactic Information in TTS Systems: Comparing Stratgies for European Portuguese , 2003, PROPOR.

[100]  Helena Moniz,et al.  Prosodically-based automatic segmentation and punctuation , 2010, Speech Prosody 2010.

[101]  Søren Wichmann,et al.  The Emerging Field of Language Dynamics , 2008, Lang. Linguistics Compass.

[102]  Ciro Martins,et al.  Broadcast news subtitling system in Portuguese , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[103]  João Paulo da Silva Neto,et al.  Evaluation of a live broadcast news subtitling system for portuguese , 2008, INTERSPEECH.

[104]  James F. Allen,et al.  Speech repains, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue , 1999, CL.

[105]  Ricardo Ribeiro,et al.  Extractive Summarization of Broadcast News: Comparing Strategies for European Portuguese , 2007, TSD.

[106]  Fernando Batista,et al.  Comparing automatic rich transcription for Portuguese, Spanish and English Broadcast News , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[107]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[108]  Isabel Trancoso,et al.  The L2F Broadcast News Speech Recognition System , 2010 .

[109]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[110]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[111]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[112]  Mark Stevenson,et al.  Experiments on Sentence Boundary Detection , 2000, ANLP.

[113]  Timothy Baldwin,et al.  Restoring Punctuation and Casing in English Text , 2009, Australasian Conference on Artificial Intelligence.

[114]  João Paulo da Silva Neto,et al.  A Prototype System for Selective Dissemination of Broadcast News in European Portuguese , 2007, EURASIP J. Adv. Signal Process..

[115]  Sadaoki Furui,et al.  Automatic Sentence Segmentation of Speech for Automatic Summarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[116]  Mary P. Harper,et al.  2005 Johns Hopkins Summer Workshop Final Report on Parsing and Spoken Structural Event Detection , 2005 .

[117]  Sadaoki Furui,et al.  Fifty years of progress in speech and speaker recognition , 2004 .

[118]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[119]  Daniel C. O'Connell,et al.  Communicating with One Another: Toward a Psychology of Spontaneous Spoken Discourse , 2008 .

[120]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[121]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[122]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[123]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[124]  David Miller,et al.  Shared resources for robust speech-to-text technology , 2003, INTERSPEECH.

[125]  João Paulo da Silva Neto,et al.  AUDIMUS.MEDIA: A Broadcast News Speech Recognition System for the European Portuguese Language , 2003, PROPOR.

[126]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[127]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[128]  Andreas Stolcke,et al.  Human language technology: opportunities and challenges , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[129]  Estelle Campione,et al.  A large-scale multilingual study of silent pause duration , 2002, Speech Prosody 2002.