论文信息 - Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

[1] Tara N. Sainath,et al. Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks , 2023, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Boris Ginsburg,et al. Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator , 2023, INTERSPEECH 2023.

[3] Tara N. Sainath,et al. JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition , 2023, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Tara N. Sainath,et al. Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[5] B. Ramabhadran,et al. Modular Hybrid Autoregressive Transducer , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[6] Tara N. Sainath,et al. JOIST: A Joint Speech and Text Streaming Model for ASR , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[7] Tara N. Sainath,et al. Turn-Taking Prediction for Natural Conversational Speech , 2022, INTERSPEECH.

[8] Tara N. Sainath,et al. Streaming Intended Query Detection using E2E Modeling for Continued Conversation , 2022, INTERSPEECH.

[9] Tara N. Sainath,et al. E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR , 2022, INTERSPEECH.

[10] H. Zen,et al. MAESTRO: Matched Speech Text Representations through Modality Matching , 2022, INTERSPEECH.

[11] Tara N. Sainath,et al. Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition , 2022, INTERSPEECH.

[12] Brian Kingsbury,et al. Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Hao Zhang,et al. Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Xie Chen,et al. Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition , 2021, INTERSPEECH.

[15] Naoyuki Kanda,et al. Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16] Tara N. Sainath,et al. Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Srikanth Ronanki,et al. Robust Prediction of Punctuation and Truecasing for Medical ASR , 2020, NLPMC.

[18] Ricardo Rei,et al. Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach , 2020, IPMU.

[19] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[20] Tara N. Sainath,et al. A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Cyril Allauzen,et al. Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Tara N. Sainath,et al. Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23] Shweta,et al. 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) , 2019 .

[24] Binh Nguyen,et al. Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[25] Hagen Soltau,et al. Joint Speech Recognition and Speaker Diarization via Sequence Transduction , 2019, INTERSPEECH.

[26] Tara N. Sainath,et al. Joint Endpointing and Decoding with End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[28] Sri Harish Reddy Mallidi,et al. Device-directed Utterance Detection , 2018, INTERSPEECH.

[29] Roland Maas,et al. Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Hiroshi Ishiguro,et al. Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents , 2017, INTERSPEECH.

[32] Tara N. Sainath,et al. Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[33] Shachar Mirkin,et al. Joint Learning of Correlated Sequence Labeling Tasks Using Bidirectional Recurrent Neural Networks , 2017, INTERSPEECH.

[34] Françoise Beaufays,et al. Language model capitalization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[36] Pedro J. Moreno,et al. A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[37] Fernando Batista,et al. Recovering Capitalization and Punctuation Marks on Speech Transcriptions , 2011 .

[38] Francoise Beaufays,et al. “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .