Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

[1]  Tara N. Sainath,et al.  Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks , 2023, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Boris Ginsburg,et al.  Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator , 2023, INTERSPEECH 2023.

[3]  Tara N. Sainath,et al.  JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition , 2023, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Tara N. Sainath,et al.  Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[5]  B. Ramabhadran,et al.  Modular Hybrid Autoregressive Transducer , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[6]  Tara N. Sainath,et al.  JOIST: A Joint Speech and Text Streaming Model for ASR , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[7]  Tara N. Sainath,et al.  Turn-Taking Prediction for Natural Conversational Speech , 2022, INTERSPEECH.

[8]  Tara N. Sainath,et al.  Streaming Intended Query Detection using E2E Modeling for Continued Conversation , 2022, INTERSPEECH.

[9]  Tara N. Sainath,et al.  E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR , 2022, INTERSPEECH.

[10]  H. Zen,et al.  MAESTRO: Matched Speech Text Representations through Modality Matching , 2022, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition , 2022, INTERSPEECH.

[12]  Brian Kingsbury,et al.  Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Hao Zhang,et al.  Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Xie Chen,et al.  Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition , 2021, INTERSPEECH.

[15]  Naoyuki Kanda,et al.  Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16]  Tara N. Sainath,et al.  Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Srikanth Ronanki,et al.  Robust Prediction of Punctuation and Truecasing for Medical ASR , 2020, NLPMC.

[18]  Ricardo Rei,et al.  Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach , 2020, IPMU.

[19]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[20]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Shweta,et al.  2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) , 2019 .

[24]  Binh Nguyen,et al.  Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[25]  Hagen Soltau,et al.  Joint Speech Recognition and Speaker Diarization via Sequence Transduction , 2019, INTERSPEECH.

[26]  Tara N. Sainath,et al.  Joint Endpointing and Decoding with End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[28]  Sri Harish Reddy Mallidi,et al.  Device-directed Utterance Detection , 2018, INTERSPEECH.

[29]  Roland Maas,et al.  Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hiroshi Ishiguro,et al.  Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents , 2017, INTERSPEECH.

[32]  Tara N. Sainath,et al.  Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[33]  Shachar Mirkin,et al.  Joint Learning of Correlated Sequence Labeling Tasks Using Bidirectional Recurrent Neural Networks , 2017, INTERSPEECH.

[34]  Françoise Beaufays,et al.  Language model capitalization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[36]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[37]  Fernando Batista,et al.  Recovering Capitalization and Punctuation Marks on Speech Transcriptions , 2011 .

[38]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .