Robust Prediction of Punctuation and Truecasing for Medical ASR

Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or "exclamation point", while truecasing enhances user readability and improves the performance of downstream NLP tasks. This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing using pretrained masked language models such as BERT, BioBERT and RoBERTa. We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data. Finally, we improve the robustness of the model against common errors made in ASR by performing data augmentation. Experiments performed on dictation and conversational style corpora show that our proposed model achieves ~5% absolute improvement on ground truth text and ~10% improvement on ASR outputs over baseline models under F1 metric.

[1]  Richard M. Schwartz,et al.  The effects of speech recognition and punctuation on information extraction performance , 2005, INTERSPEECH.

[2]  Jacob Eisenstein,et al.  Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling , 2019, EMNLP.

[3]  David Suendermann-Oeft,et al.  Deep Learning for Punctuation Restoration in Medical Reports , 2017, BioNLP.

[4]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[5]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[6]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yaser Al-Onaizan,et al.  Neural Word Decomposition Models for Abusive Language Detection , 2019, ArXiv.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Markus Freitag,et al.  Modeling punctuation prediction as machine translation , 2011, IWSLT.

[10]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[11]  Peter Bell,et al.  Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[14]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[15]  Binh Nguyen,et al.  Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[16]  Jianhua Tao,et al.  Self-attention Based Model for Punctuation Prediction Using Word and Speech Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Shachar Mirkin,et al.  Joint Learning of Correlated Sequence Labeling Tasks Using Bidirectional Recurrent Neural Networks , 2017, INTERSPEECH.

[18]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Navdeep Jaitly,et al.  Speech recognition for medical conversations , 2017, INTERSPEECH.

[20]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[21]  V. Silber-Varod,et al.  The effect of pitch, intensity and pause duration in punctuation detection , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Jan Niehues,et al.  Segmentation and punctuation prediction in speech language translation using a monolingual translation system , 2012, IWSLT.

[24]  Jan Niehues,et al.  Combination of NN and CRF models for joint detection of punctuation and disfluencies , 2015, INTERSPEECH.

[25]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[26]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[27]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Joris Driesen,et al.  Automated production of true-cased punctuated subtitles for weather and news broadcasts , 2014, INTERSPEECH.

[30]  David Suendermann-Oeft,et al.  Medical Speech Recognition: Reaching Parity with Humans , 2017, SPECOM.

[31]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[32]  Nicola Ueffing,et al.  Improved models for automatic punctuation prediction for spoken and written text , 2013, INTERSPEECH.

[33]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.