论文信息 - vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Michael Auli | Alexei Baevski | Steffen Schneider | Steffen Schneider

[1] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[2] Rohit Prabhavalkar,et al. On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[3] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[4] Haizhou Li,et al. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[5] Julien Mairal,et al. Unsupervised Pre-Training of Image Features on Non-Curated Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.

[7] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8] Ewald van der Westhuizen,et al. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[9] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[10] Gabriel Synnaeve,et al. Who Needs Words? Lexicon-Free Speech Recognition , 2019, INTERSPEECH.

[11] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[12] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[13] Gabriel Synnaeve,et al. A Fully Differentiable Beam Search Decoder , 2019, ICML.

[14] Ron J. Weiss,et al. Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[16] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[17] James R. Glass,et al. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[18] Yoshua Bengio,et al. Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[19] Iasonas Kokkinos,et al. Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Sanjeev Khudanpur,et al. Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21] Aren Jansen,et al. The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[24] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[26] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27] Aren Jansen,et al. The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[28] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[29] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Tom Minka,et al. A* Sampling , 2014, NIPS.

[31] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[32] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[33] Timothy B. Terriberry,et al. Definition of the Opus Audio Codec , 2012, RFC.

[34] Cordelia Schmid,et al. Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] E. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .