Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Automatic Speech Recognition (ASR) systems introduce word errors, which often confuse punctuation prediction models, turning punctuation restoration into a challenging task. These errors usually take the form of homonyms. We show how retrofitting of the word embeddings on the domain-specific data can mitigate ASR errors. Our main contribution is a method for better alignment of homonym embeddings and the validation of the presented method on the punctuation prediction task. We record the absolute improvement in punctuation prediction accuracy between 6.2% (for question marks) to 9% (for periods) when compared with the state-of-the-art model.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[3]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[4]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[5]  Dong Wang,et al.  Question Mark Prediction By Bert , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[6]  Shinji Watanabe,et al.  Acoustic Modeling for Overlapping Speech Recognition: Jhu Chime-5 Challenge System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Marcin Witkowski,et al.  Structure of pauses in speech in the context of speaker verification and classification of speech type , 2016, EURASIP J. Audio Speech Music. Process..

[8]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Christopher Potts,et al.  Mittens: an Extension of GloVe for Learning Domain-Specialized Representations , 2018, NAACL.

[11]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[12]  Christoph Meinel,et al.  Punctuation Prediction for Unsegmented Transcript Based on Word Vector , 2016, LREC.

[13]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[14]  Markus Freitag,et al.  Modeling punctuation prediction as machine translation , 2011, IWSLT.

[15]  Shrikanth S. Narayanan,et al.  A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Andreas Stolcke,et al.  Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech , 2004, EMNLP.

[17]  Binh Nguyen,et al.  Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  William Gale,et al.  Experiments in Character-Level Neural Network Models for Punctuation , 2017, INTERSPEECH.

[20]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[21]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[22]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[23]  Jan Niehues,et al.  Punctuation insertion for real-time spoken language translation , 2017, IWSLT.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Mireia Farrús,et al.  Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech , 2017, SLSP.

[26]  Andreas Stolcke,et al.  Improving Automatic Sentence Boundary Detection with Confusion Networks , 2004, NAACL.

[27]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[29]  Larry Gillick,et al.  A hidden Markov model approach to text segmentation and event tracking , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[30]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[31]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[32]  Najim Dehak,et al.  Punctuation Prediction Model for Conversational Speech , 2018, INTERSPEECH.

[33]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[34]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[35]  Qian Chen,et al.  Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).