Réseaux de neurones profonds appliqués à la compréhension de la parole. (Deep learning applied to spoken langage understanding)

Cette these s'inscrit dans le cadre de l'emergence de l'apprentissage profond et aborde la comprehension de la parole assimilee a l'extraction et a la representation automatique du sens contenu dans les mots d'une phrase parlee. Nous etudions une tâche d'etiquetage en concepts semantiques dans un contexte de dialogue oral evaluee sur le corpus francais MEDIA. Depuis une dizaine d'annees, les modeles neuronaux prennent l'ascendant dans de nombreuses tâches de traitement du langage naturel grâce a des avancees algorithmiques ou a la mise a disposition d'outils de calcul puissants comme les processeurs graphiques. De nombreux obstacles rendent la comprehension complexe, comme l'interpretation difficile des transcriptions automatiques de la parole etant donne que de nombreuses erreurs sont introduites par le processus de reconnaissance automatique en amont du module de comprehension. Nous presentons un etat de l'art decrivant la comprehension de la parole puis les methodes d'apprentissage automatique supervise pour la resoudre en commencant par des systemes classiques pour finir avec des techniques d'apprentissage profond. Les contributions sont ensuite exposees suivant trois axes. Premierement, nous developpons une architecture neuronale efficace consistant en un reseau recurent bidirectionnel encodeur-decodeur avec mecanisme d’attention. Puis nous abordons la gestion des erreurs de reconnaissance automatique et des solutions pour limiter leur impact sur nos performances. Enfin, nous envisageons une desambiguisation de la tâche de comprehension permettant de rendre notre systeme plus performant.

[1]  Matthieu Quignard,et al.  MEDIA: a semantically annotated corpus of task oriented dialogs in French , 2009, Lang. Resour. Evaluation.

[2]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Mark Steedman,et al.  Transforming Dependency Structures to Logical Forms for Semantic Parsing , 2016, TACL.

[4]  Ruhi Sarikaya,et al.  Convolutional neural network based triangular CRF for joint intent detection and slot filling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Wolfgang Minker,et al.  Speech and Human-Machine Dialog , 2006 .

[6]  Hermann Ney,et al.  Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[8]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[9]  Mari Ostendorf,et al.  Language Modeling with Sentence-Level Mixtures , 1994, HLT.

[10]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[11]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[12]  Frédéric Béchet,et al.  DECODA: a call-centre human-human spoken conversation corpus , 2012, LREC.

[13]  Eric Horvitz,et al.  Optimizing Automated Call Routing by Integrating Spoken Dialog Models with Queuing Models , 2004, NAACL.

[14]  Elmar Nöth,et al.  Comparison and Combination of Confidence Measures , 2002, TSD.

[15]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[16]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[17]  Benoît Favre,et al.  Robustesse et portabilités multilingue et multi-domaines des systèmes de compréhension de la parole : les corpus du projet PortMedia (Robustness and portability of spoken language understanding systems among languages and domains : the PORTMEDIA project) [in French] , 2012, JEP/TALN/RECITAL.

[18]  JurafskyDaniel,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000 .

[19]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20]  Ye-Yi Wang,et al.  Is word error rate a good indicator for spoken language understanding accuracy , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Isabelle Tellier,et al.  Improving Recurrent Neural Networks For Sequence Labelling , 2016, ArXiv.

[22]  Mark-Jan Nederhof,et al.  Regular Approximation of Context-Free Grammars through Transformation , 2001 .

[23]  Holger Schwenk,et al.  Continuous Space Language Models for Statistical Machine Translation , 2006, ACL.

[24]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[25]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[26]  Geoffrey E. Hinton,et al.  Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Carlos Martín-Vide,et al.  Statistical Language and Speech Processing , 2014, Lecture Notes in Computer Science.

[28]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[29]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Guillaume Gravier,et al.  Is it time to Switch to word embedding and recurrent neural networks for spoken language understanding? , 2015, INTERSPEECH.

[31]  Paul Deléglise,et al.  LIUM and CRIM ASR System Combination for the REPERE Evaluation Campaign , 2014, TSD.

[32]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[33]  David Griol,et al.  A Two-Stage Combining Classifier Model for the Development of Adaptive Dialog Systems , 2016, Int. J. Neural Syst..

[34]  Olivier Pietquin,et al.  Comparing ASR modeling methods for spoken dialogue simulation and optimal strategy learning , 2005, INTERSPEECH.

[35]  Tilman Becker,et al.  Combining Multiple Information Layers for the Automatic Generation of Indicative Meeting Abstracts , 2007, ENLG.

[36]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[37]  Paul Deléglise,et al.  Exploring the use of Attention-Based Recurrent Neural Networks For Spoken Language Understanding , 2015, NIPS 2015.

[38]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[39]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[40]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[41]  Gökhan Tür,et al.  Syntax or semantics? knowledge-guided joint semantic frame parsing , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[42]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[43]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[44]  Alex Acero,et al.  Combining Statistical and Knowledge-Based Spoken Language Understanding in Conditional Models , 2006, ACL.

[45]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[46]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[47]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[48]  Christian Raymond,et al.  Label-Dependency Coding in Simple Recurrent Networks for Spoken Language Understanding , 2017, INTERSPEECH.

[49]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[50]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[51]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[52]  Dilek Z. Hakkani-Tür,et al.  Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[53]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[54]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[55]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[56]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[57]  Fethi Bougares,et al.  NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems , 2017, Prague Bull. Math. Linguistics.

[58]  Eric Fosler-Lussier,et al.  Discriminative language modeling using simulated ASR errors , 2010, INTERSPEECH.

[59]  Johanna D. Moore,et al.  Participant Subjectivity and Involvement as a Basis for Discourse Segmentation , 2009, SIGDIAL Conference.

[60]  Frédéric Béchet,et al.  Detection and Interpretation of Opinion Expressions in Spoken Surveys , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[61]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[62]  Fabrice Lefèvre,et al.  Investigating multiple approaches for SLU portability to a new language , 2010, INTERSPEECH.

[63]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[64]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[65]  Boris Detienne,et al.  Unsupervised Concept Annotation using Latent Dirichlet Allocation and Segmental Methods , 2011, ULNLP@EMNLP.

[66]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[67]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[68]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[69]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[70]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[71]  Rebecca J. Passonneau,et al.  Discourse Segmentation by Human and Automated Means , 1997, CL.

[72]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[73]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[74]  Gökhan Tür,et al.  What is left to be understood in ATIS? , 2010, 2010 IEEE Spoken Language Technology Workshop.

[75]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[76]  Frédéric Béchet,et al.  The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems , 2004, LREC.

[77]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[78]  Benoît Favre,et al.  Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora , 2012, LREC.

[79]  Hermann Ney,et al.  RASR/NN: The RWTH neural network toolkit for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Martine Adda-Decker,et al.  Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction , 2015, SLSP.

[81]  Chin-Hui Lee,et al.  A speech understanding system based on statistical representation of semantics , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[82]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  Alexander I. Rudnicky,et al.  Investigations on ensemble based semi-supervised acoustic model training , 2005, INTERSPEECH.

[84]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[85]  Bowen Zhou,et al.  Dependency-based Convolutional Neural Networks for Sentence Embedding , 2015, ACL.

[86]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[87]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[88]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[89]  Gökhan Tür,et al.  The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[90]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[91]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[92]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[93]  E. Fosler-Lussier,et al.  ON THE ROAD TO IMPROVED LEXICAL CONFUSABILITY METRICS , 2000 .

[94]  Maxine Eskénazi,et al.  DialPort: Connecting the spoken dialog research community to real user data , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[95]  Atsunori Ogawa,et al.  ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Sarah Flora Samson Juan Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia. (Utilisation de ressources dans une langue proche pour la reconnaissance automatique de la parole pour les langues peu dotées de Malaisie) , 2015 .

[97]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[98]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[99]  Mei-Yuh Hwang,et al.  Microsoft Windows highly intelligent speech recognizer: Whisper , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[100]  Giuseppe Carenini,et al.  Interpretation and Transformation for Abstracting Conversations , 2010, HLT-NAACL.

[101]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[102]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[103]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[104]  Yannick Estève,et al.  Le corpus PASTEL pour le traitement automatique de cours magistraux (PASTEL corpus for automatic processing of lectures) , 2018, CORIA-TALN-RJC.

[105]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[106]  Yannick Estève,et al.  Word embeddings combination and neural networks for robustness in ASR error detection , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[107]  Frédéric Béchet,et al.  Is ATIS Too Shallow to Go Deeper for Benchmarking Spoken Language Understanding Models? , 2018, INTERSPEECH.

[108]  Renato De Mori,et al.  Spoken Dialogues with Computers , 1998 .

[109]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[110]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[111]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[112]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[113]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[114]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[116]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[117]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[118]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[119]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[120]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[121]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[122]  James F. Allen,et al.  Deep Linguistic Processing for Spoken Dialogue Systems , 2007, ACL 2007.

[123]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[124]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[125]  Aaron E. Rosenberg,et al.  Improved Acoustic Modeling for Continuous Speech Recognition , 1990, HLT.

[126]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[127]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.