Personalizing ASR with limited data using targeted subset selection

We study the task of personalizing ASR models to a target non-native speaker/accent while being constrained by a transcription budget on the duration of utterances selected from a large unlabelled corpus. We propose a subset selection approach using the recently proposed submodular mutual information functions, in which we identify a diverse set of utterances that match the target speaker/accent. This is specified through a few target utterances and achieved by modelling the relationship between the target subset and the selected subset using submodular mutual information functions. This method is applied at both the speaker and accent levels. We person-alize the model by fine tuning it with utterances selected and transcribed from the unlabelled corpus. Our method is able to consistently identify utterances from the target speaker/accent using just speech features. We show that the targeted sub-set selection approach improves upon random sampling by as much as 2% to 5% (absolute) depending on the speaker and accent and is 2x to 4x more label-efficient compared to random sampling. We also compare with a skyline where we specifically pick from the target and our method generally outperforms the oracle in its selections. 1

[1]  Suraj Kothawade,et al.  SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios , 2021, NeurIPS.

[2]  Khe Chai Sim,et al.  On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech , 2021, ArXiv.

[3]  Sunita Sarawagi,et al.  Error-Driven Fixed-Budget ASR Personalization for Accented Speakers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Suraj Kothawade,et al.  A Unified Framework for Generic, Query-Focused, Privacy Preserving and Update Summarization using Submodular Information Measures , 2020, ArXiv.

[5]  Rishabh K. Iyer,et al.  Submodular Combinatorial Information Measures with Applications in Machine Learning , 2020, ALT.

[6]  Boris Ginsburg,et al.  Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Khe Chai Sim,et al.  An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models , 2019, INTERSPEECH.

[8]  Yossi Matias,et al.  Personalizing ASR for Dysarthric and Accented Speech with Limited Data , 2019, INTERSPEECH.

[9]  Boris Ginsburg,et al.  Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks , 2019, ArXiv.

[10]  Hong-Goo Kang,et al.  Gradient-based Active Learning Query Strategy for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mei-Yuh Hwang,et al.  Domain Adversarial Training for Accented Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Michael I. Mandel,et al.  Active learning for low-resource speech recognition: Impact of selection size and language modeling data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jeff A. Bilmes,et al.  Submodularity for Data Selection in Machine Translation , 2014, EMNLP.

[17]  Jeff A. Bilmes,et al.  Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[19]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[20]  Koichi Shinoda,et al.  Speech modeling based on committee-based active learning , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[22]  Suraj Kothawade,et al.  PRISM: A Unified Framework of Parameterized Submodular Information Measures for Targeted Data Subset Selection and Summarization , 2021, ArXiv.

[23]  Anupam Gupta,et al.  The Online Submodular Cover Problem , 2020, SODA.

[24]  Rishabh K. Iyer,et al.  SVitchboard II and fiSVer i: high-quality limited-complexity corpora of conversational English speech , 2015, INTERSPEECH.

[25]  藤重 悟 Submodular functions and optimization , 1991 .

[26]  Faustino J. Gomez,et al.  Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks , 2022 .

[27]  Connectionist Temporal Classification: Labelling Unsegmented Sequences with Recurrent Neural Networks , 2022 .