SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering

While various end-to-end models for spoken language understanding tasks have been explored recently, this paper is probably the first known attempt to challenge the very difficult task of end-to-end spoken question answering (SQA). Learning from the very successful BERT model for various text processing tasks, here we proposed an audio-and-text jointly learned SpeechBERT model. This model outperformed the conventional approach of cascading ASR with the following text question answering (TQA) model on datasets including ASR errors in answer spans, because the end-to-end model was shown to be able to extract information out of audio data before ASR produced errors. When ensembling the proposed end-to-end model with the cascade architecture, even better performance was achieved. In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

[1]  James R. Glass,et al.  Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[2]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[4]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[5]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[6]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[7]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  James R. Glass,et al.  Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[10]  Lin-Shan Lee,et al.  Order-Preserving Abstractive Summarization for Spoken Content Based on Connectionist Temporal Classification , 2017, INTERSPEECH.

[11]  Yuzong Liu,et al.  Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13]  Guangsen Wang,et al.  Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks , 2020, INTERSPEECH.

[14]  Lin-Shan Lee,et al.  Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yelong Shen,et al.  FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension , 2017, ICLR.

[16]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[17]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[18]  Yuxing Peng,et al.  Reinforced Mnemonic Reader for Machine Comprehension , 2017 .

[19]  Themos Stafylakis,et al.  End-to-End Architectures for ASR-Free Spoken Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hung-yi Lee,et al.  Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.

[21]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[22]  Shiyu Zhou,et al.  Unsupervised pre-traing for sequence to sequence speech recognition , 2019, ArXiv.

[23]  Brian Kingsbury,et al.  Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[25]  Lin-Shan Lee,et al.  Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data , 2018, ArXiv.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[28]  Xiangang Li,et al.  Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[29]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Ryan Price End-To-End Spoken Language Understanding Without Matched Language Speech Model Pretraining Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[33]  Srinivas Bangalore,et al.  Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[35]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[36]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[37]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.