Prédiction de performance des systèmes de reconnaissance automatique de la parole à l’aide de réseaux de neurones convolutifs [Performance prediction of automatic speech recognition systems using convolutional neural networks]

Dans ce travail, nous nous interessons a la tâche de prediction de performance des systemes de transcription de la parole. Nous comparons deux approches de prediction: une approche de l'etat de l'art fondee sur l'extraction explicite de traits et une nouvelle approche fondee sur des caracteristiques entrainees implicitement a l'aide des reseaux neuronaux convo-lutifs (CNN). Nous essayons ensuite de comprendre quelles informations sont capturees par notre modele neuronal et leurs liens avec differents facteurs. Pour tirer profit de cette analyse, nous proposons un systeme multitâche qui se montre legerement plus efficace sur la tâche de prediction de performance. ABSTRACT. This paper focuses on the ASR performance prediction task. Two prediction approaches are compared: a state-of-the-art performance prediction based on engineered features and a new strategy based on learnt features using convolutional neural networks. We also try to better understand which information is captured by the deep model and its relation with different conditioning factors. To take advantage of this analysis, we then try to leverage these 3 types of information at training time through multi-task learning, which is slightly more efficient on ASR performance prediction task. MOTS-CLES : prediction de performance, reconnaissance de la parole continue a grand vocabu-laire, reseau neuronal convolutif.

[1]  Hynek Hermansky,et al.  Predicting error rates for unknown data in automatic speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  Sheryl R. Young,et al.  Recognition Confidence Measures: Detection of Misrecognitions and Out- Of-Vocabulary Words , 1994 .

[4]  Maurizio Omologo,et al.  Boosted acoustic model learning and hypotheses rescoring on the CHiME-3 task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[6]  Richard M. Schwartz,et al.  Automatic Detection Of New Words In A Large Vocabulary Continuous Speech Recognition System , 1989, HLT.

[7]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Julien Pinquier,et al.  Prédiction a priori de la qualité de la transcription automatique de la parole bruitée , 2018, XXXIIe Journées d’Études sur la Parole.

[10]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Hamed Zamani,et al.  Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances , 2015, NAACL.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  José Guilherme Camargo de Souza,et al.  FBK-UEdin Participation to the WMT13 Quality Estimation Shared Task , 2013, WMT@ACL.

[14]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[16]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[17]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[18]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[19]  José Guilherme Camargo de Souza,et al.  Quality Estimation for Automatic Speech Recognition , 2014, COLING.

[20]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[21]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[23]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[24]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[25]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[26]  José Guilherme Camargo de Souza,et al.  TranscRater: a Tool for Automatic Speech Recognition Quality Estimation , 2016, ACL.

[27]  Yonatan Belinkov,et al.  Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems , 2017, NIPS.

[28]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[29]  Daniele Falavigna,et al.  Driving ROVER with Segment-based ASR Quality Estimation , 2015, ACL.

[30]  Guy Perennou,et al.  BDLEX: a lexicon for spoken and written french , 1998, LREC.

[31]  Thomas Pellegrini,et al.  Inferring Phonemic Classes from CNN Activation Maps Using Clustering Techniques , 2016, INTERSPEECH.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  Olivier Galibert,et al.  Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech , 2013, INTERSPEECH.

[34]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yonatan Belinkov,et al.  Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks , 2017, IJCNLP.

[36]  Simon King,et al.  Investigating gated recurrent neural networks for speech synthesis , 2016 .

[37]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[38]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[39]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.