论文信息 - Word Order does not Matter for Speech Recognition

Word Order does not Matter for Speech Recognition

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

[1] Takaaki Hori,et al. Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition , 2021, Interspeech.

[2] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Michael Auli,et al. Unsupervised Speech Recognition , 2021, NeurIPS.

[4] Andreas Stolcke,et al. Wav2vec-C: A Self-supervised Model for Speech Representation Learning , 2021, Interspeech.

[5] James R. Glass,et al. Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[6] Ronan Collobert,et al. Joint Masked CPC And CTC Training For ASR , 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Jonathan Le Roux,et al. Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Gabriel Synnaeve,et al. Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[9] Ronan Collobert,et al. slimIPL: Language-Model-Free Iterative Pseudo-Labeling , 2020, Interspeech.

[10] Quoc V. Le,et al. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[11] Jun Huang,et al. Weakly Supervised Construction of ASR Systems from Massive Video Data , 2021, Interspeech.

[12] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[14] Quoc V. Le,et al. Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[15] Gabriel Synnaeve,et al. Iterative Pseudo-Labeling for Speech Recognition , 2020, INTERSPEECH.

[16] Geoffrey Zweig,et al. Large scale weakly and semi-supervised learning for low-resource video ASR , 2020, INTERSPEECH.

[17] Yoshua Bengio,et al. Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Chao Wang,et al. Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.

[19] Edouard Grave,et al. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[20] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[21] Awni Y. Hannun,et al. Self-Training for End-to-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Gabriel Synnaeve,et al. Word-Level Speech Recognition With a Letter to Word Encoder , 2019, ICML.

[23] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[24] Lin-Shan Lee,et al. Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models , 2019, INTERSPEECH.

[25] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[26] Michael Picheny,et al. Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Gabriel Synnaeve,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Chengzhu Yu,et al. Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching , 2018, ICLR.

[29] Lin-Shan Lee,et al. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings , 2018, INTERSPEECH.

[30] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[31] Juergen Gall,et al. Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[33] Dimitri Palaz,et al. Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.

[34] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36] Dong-Hyun Lee,et al. Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .