Libri-Light: A Benchmark for ASR with Limited or No Supervision

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi- supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

[1]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[5]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[6]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ronan Collobert,et al.  Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions , 2019, INTERSPEECH.

[8]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[9]  Awni Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Satoshi Nakamura,et al.  Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  James R. Glass,et al.  Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[13]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[14]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[15]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[17]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[18]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[19]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[20]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[22]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[23]  Lin-Shan Lee,et al.  Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data , 2018, ArXiv.

[24]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Lukás Burget,et al.  An empirical evaluation of zero resource acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Thomas Schatz ABX-Discriminability Measures and Applications , 2016 .

[27]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[28]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.