End2End Acoustic to Semantic Transduction

In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context.

[1]  Renato De Mori,et al.  ASR Error Management for Improving Spoken Language Understanding , 2017, INTERSPEECH.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Houfeng Wang,et al.  A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding , 2016, IJCAI.

[4]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Srinivas Bangalore,et al.  Improved End-To-End Spoken Utterance Classification with a Self-Attention Acoustic Classifier , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Benoît Favre,et al.  Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora , 2012, LREC.

[8]  Frédéric Béchet,et al.  DECODA: a call-centre human-human spoken conversation corpus , 2012, LREC.

[9]  James R. Glass,et al.  Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[10]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[12]  Frédéric Béchet,et al.  The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems , 2004, LREC.

[13]  Pengwei Wang,et al.  Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jian Chen,et al.  Fg2seq: Effectively Encoding Knowledge for End-To-End Task-Oriented Dialog , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  James Glass,et al.  CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning , 2020, ArXiv.

[16]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[17]  Shinji Watanabe,et al.  End-to-end Speech Recognition With Word-Based Rnn Language Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[18]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[20]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Renato De Mori,et al.  Dialogue History Integration into End-to-End Signal-to-Concept Spoken Language Understanding Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Renato De Mori,et al.  The Application of Semantic Classification Trees to Natural Language Understanding , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[26]  Yannick Estève,et al.  Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability , 2019, INTERSPEECH.

[27]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[28]  Yannick Estève,et al.  End-To-End Named Entity And Semantic Concept Extraction From Speech , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[29]  Nanyun Peng,et al.  Espresso: A Fast End-to-End Neural Speech Recognition Toolkit , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).