Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously without preparing matched datasets, our key idea is to distinguish individual conversion tasks using the on-off switch. In our proposed zero-shot joint modeling, we switch the individual tasks using multiple switching tokens, enabling us to utilize a zero-shot learning approach to executing simultaneous conversions. Our experiments on joint modeling of disfluency deletion and punctuation restoration demonstrate the effectiveness of our method.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Shruti Sannon,et al.  "Alexa is my new BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo , 2017, CHI Extended Abstracts.

[3]  Yue Zhang,et al.  Transition-Based Disfluency Detection using LSTMs , 2017, EMNLP.

[4]  Seokhwan Kim,et al.  Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[6]  Yu Shi,et al.  Improving Readability for Automatic Speech Recognition Transcription , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[7]  Nguyen Bach,et al.  Noisy BiLSTM-Based Models for Disfluency Detection , 2019, INTERSPEECH.

[8]  Shuang Xu,et al.  Adapting Translation Models for Transcript Disfluency Detection , 2019, AAAI.

[9]  Jean-Pierre Lorré,et al.  Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization , 2018, ACL.

[10]  Akihiko Takashima,et al.  Parallel Corpus for Japanese Spoken-to-Written Style Conversion , 2020, LREC.

[11]  Máté Ákos Tündik,et al.  Á bilingual comparison of MaxEnt-and RNN-based punctuation restoration in speech transcripts , 2017, 2017 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[12]  Ryo Masumura,et al.  MAPGN: Masked Pointer-Generator Network for Sequence-to-Sequence Pre-Training , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bharat Ram Ambati,et al.  A Mostly Data-Driven Approach to Inverse Text Normalization , 2017, INTERSPEECH.

[14]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[15]  Akihiko Takashima,et al.  Large-Context Pointer-Generator Networks for Spoken-to-Written Style Conversion , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Joel Tetreault,et al.  Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses , 2019, Transactions of the Association for Computational Linguistics.

[17]  Boris Ginsburg,et al.  Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ryo Masumura,et al.  Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling with BLSTM-CRFs and Attention Mechanisms , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[19]  Min Yang,et al.  Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning , 2019, WWW.

[20]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[21]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[22]  Binh Nguyen,et al.  Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[23]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[24]  Yun-Cheng Ju,et al.  A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications , 2008, INTERSPEECH.

[25]  Heng Ji,et al.  Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization , 2019, ACL.

[26]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Aditya Krishna Menon,et al.  Does label smoothing mitigate label noise? , 2020, ICML.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Katrin Kirchhoff,et al.  Neural Inverse Text Normalization , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Akihiko Takashima,et al.  Memory Attentive Fusion: External Language Model Integration for Transformer-based Sequence-to-Sequence Model , 2020, INLG.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.