Word Order does not Matter for Speech Recognition

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

[1]  Takaaki Hori,et al.  Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition , 2021, Interspeech.

[2]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Michael Auli,et al.  Unsupervised Speech Recognition , 2021, NeurIPS.

[4]  Andreas Stolcke,et al.  Wav2vec-C: A Self-supervised Model for Speech Representation Learning , 2021, Interspeech.

[5]  James R. Glass,et al.  Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[6]  Ronan Collobert,et al.  Joint Masked CPC And CTC Training For ASR , 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jonathan Le Roux,et al.  Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[9]  Ronan Collobert,et al.  slimIPL: Language-Model-Free Iterative Pseudo-Labeling , 2020, Interspeech.

[10]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[11]  Jun Huang,et al.  Weakly Supervised Construction of ASR Systems from Massive Video Data , 2021, Interspeech.

[12]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[14]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[15]  Gabriel Synnaeve,et al.  Iterative Pseudo-Labeling for Speech Recognition , 2020, INTERSPEECH.

[16]  Geoffrey Zweig,et al.  Large scale weakly and semi-supervised learning for low-resource video ASR , 2020, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Chao Wang,et al.  Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.

[19]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[20]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[21]  Awni Y. Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Gabriel Synnaeve,et al.  Word-Level Speech Recognition With a Letter to Word Encoder , 2019, ICML.

[23]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[24]  Lin-Shan Lee,et al.  Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models , 2019, INTERSPEECH.

[25]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[26]  Michael Picheny,et al.  Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Chengzhu Yu,et al.  Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching , 2018, ICLR.

[29]  Lin-Shan Lee,et al.  Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings , 2018, INTERSPEECH.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Juergen Gall,et al.  Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[33]  Dimitri Palaz,et al.  Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.

[34]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[36]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .