Attelage de systèmes de transcription automatique de la parole

Nous abordons, dans cette these, les methodes de combinaison de systemesde transcription de la parole a Large Vocabulaire. Notre etude se concentre surl’attelage de systemes de transcription heterogenes dans l’objectif d’ameliorerla qualite de la transcription a latence contrainte. Les systemes statistiquessont affectes par les nombreuses variabilites qui caracterisent le signal dela parole. Un seul systeme n’est generalement pas capable de modeliserl’ensemble de ces variabilites. La combinaison de differents systemes detranscription repose sur l’idee d’exploiter les points forts de chacun pourobtenir une transcription finale amelioree. Les methodes de combinaisonproposees dans la litterature sont majoritairement appliquees a posteriori,dans une architecture de transcription multi-passes. Cela necessite un tempsde latence considerable induit par le temps d’attente requis avant l’applicationde la combinaison.Recemment, une methode de combinaison integree a ete proposee. Cettemethode est basee sur le paradigme de decodage guide (DDA :Driven DecodingAlgorithm) qui permet de combiner differents systemes durant le decodage. Lamethode consiste a integrer des informations en provenance de plusieurs systemes dits auxiliaires dans le processus de decodage d’un systeme dit primaire.Notre contribution dans le cadre de cette these porte sur un double aspect : d’une part, nous proposons une etude sur la robustesse de la combinaison par decodage guide. Nous proposons ensuite, une amelioration efficacement generalisable basee sur le decodage guide par sac de n-grammes,appele BONG. D’autre part, nous proposons un cadre permettant l’attelagede plusieurs systemes mono-passe pour la construction collaborative, a latencereduite, de la sortie de l’hypothese de reconnaissance finale. Nous presentonsdifferents modeles theoriques de l’architecture d’attelage et nous exposons unexemple d’implementation en utilisant une architecture client/serveur distribuee. Apres la definition de l’architecture de collaboration, nous nous focalisons sur les methodes de combinaison adaptees a la transcription automatiquea latence reduite. Nous proposons une adaptation de la combinaison BONGpermettant la collaboration, a latence reduite, de plusieurs systemes mono-passe fonctionnant en parallele. Nous presentons egalement, une adaptationde la combinaison ROVER applicable durant le processus de decodage via unprocessus d’alignement local suivi par un processus de vote base sur la frequence d’apparition des mots. Les deux methodes de combinaison proposeespermettent la reduction de la latence de la combinaison de plusieurs systemesmono-passe avec un gain significatif du WER.

[1]  Frederick Jelinek,et al.  Improved clustering techniques for class-based statistical language modeling , 1999 .

[2]  Gerald Friedland,et al.  Opportunities and challenges of parallelizing speech recognition , 2010 .

[3]  Georges Linarès,et al.  System Combination by Driven Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[5]  Ananth Sankar Bayesian model combination (BAYCOM) for improved recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Yannick Estève,et al.  Systèmes de transcription automatique de la parole et logiciels libres , 2004 .

[8]  Elmar Nöth,et al.  Comparison and Combination of Confidence Measures , 2002, TSD.

[9]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[10]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[11]  Hynek Hermansky,et al.  Perceptual Linear Predictive (PLP) Analysis-Resynthesis Technique , 1991, Final Program and Paper Summaries 1991 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  Richard M. Stern,et al.  Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[14]  Paul Deléglise,et al.  Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate? , 2009, INTERSPEECH.

[15]  Mark J. F. Gales,et al.  Generating Complementary Systems for Speech Recognition , 2022 .

[16]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Hermann Ney,et al.  A comparison of two LVR search optimization techniques , 2002, INTERSPEECH.

[18]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[20]  Yannick Estève Intégration de sources de connaissances pour la modélisation stochastique du langage appliquée à la parole continue dans un contexte de dialogue oral homme-machine , 2002 .

[21]  Paul Deléglise,et al.  Unsupervised model adaptation on targeted speech segments for LVCSR system combination , 2010, INTERSPEECH.

[22]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  Michael Riley,et al.  Towards automatic closed captioning : low latency real time broadcast news transcription , 2002, INTERSPEECH.

[24]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Mark J. F. Gales,et al.  Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Ludek Müller,et al.  Comparison of MFCC and PLP parameterizations in the speaker independent continuous speech recognition task , 2001, INTERSPEECH.

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  Daniel P. W. Ellis STREAM COMBINATION BEFORE AND/OR AFTER THE ACOUSTIC MODEL , 1999 .

[29]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[30]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[31]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[33]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[34]  Gérard Chollet,et al.  Vers le temps réel en transcription automatique de la parole grand vocabulaire , 2007 .

[35]  Hermann Ney,et al.  iROVER: Improving System Combination with Classification , 2007, NAACL.

[36]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[37]  Pascale Sébillot,et al.  Morphosyntactic processing of n-best lists for improved recognition and confidence measure computation , 2007, INTERSPEECH.

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  Frédéric Béchet,et al.  The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News , 2010, LREC.

[40]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  Benjamin Lecouteux Reconnaissance automatique de la parole guidée par des transcriptions a priori. (driven decoding for speech recognition system combination) , 2008 .

[42]  Mari Ostendorf,et al.  Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses , 1991, HLT.

[43]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[44]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[45]  Florian Metze,et al.  Parallelization Strategies for a Dynamic Lexical Tree Decoder , 2011 .

[46]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[47]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[49]  Georg Heigold,et al.  The RWTH 2007 TC-STAR evaluation system for european English and Spanish , 2007, INTERSPEECH.

[50]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[51]  Michael Collins,et al.  Trigger-Based Language Modeling using a Loss-Sensitive Perceptron Algorithm , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  Xiang Li,et al.  Combining search spaces of heterogeneous recognizers for improved speech recogniton , 2002, INTERSPEECH.

[53]  Hoirin Kim,et al.  Compensating Acoustic Mismatch Using Class-Based Histogram Equalization for Robust Speech Recognition , 2007, EURASIP J. Adv. Signal Process..

[54]  Frederick Jelinek,et al.  Continuous speech recognition , 1977, SGAR.

[55]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[56]  Julie Mauclair Mesures de confiance en traitement automatique de la parole et applications , 2006 .

[57]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[58]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[59]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[60]  Mark J. F. Gales,et al.  Use of contexts in language model interpolation and adaptation , 2009, Comput. Speech Lang..

[61]  William J. Byrne,et al.  Lattice segmentation and support vector machines for large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[62]  Mark J. F. Gales,et al.  Directed decision trees for generating complementary systems , 2009, Speech Commun..

[63]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[64]  Sebastian Stüker,et al.  Overview of the IWSLT 2011 evaluation campaign , 2011, IWSLT.

[65]  Geoffrey Zweig,et al.  Boosting Gaussian mixtures in an LVCSR system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[66]  Georges Linarès,et al.  Avancées dans le domaine de la transcription automatique par décodage guidé (Improvements on driven decoding system combination) [in French] , 2012, JEP-TALN-RECITAL 2012.

[67]  Georges Linarès,et al.  Bag of n-gram driven decoding for LVCSR system harnessing , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[68]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[69]  Paul Deléglise,et al.  LIUM's systems for the IWSLT 2011 speech translation tasks , 2011, IWSLT.

[70]  Hermann Ney,et al.  Look-ahead techniques for fast beam search , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[71]  Georges Linarès,et al.  Low latency combination of parallelized single-pass LVCSR systems , 2012, INTERSPEECH.

[72]  Vassilios Digalakis,et al.  Speaker adaptation using combined transformation and Bayesian methods , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[73]  Georges Linarès,et al.  Imperfect transcript driven speech recognition , 2006, INTERSPEECH.

[74]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[75]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[76]  Guy Perennou,et al.  BDLEX lexical data and knowledge base of spoken and written French , 1987, ECST.

[77]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[78]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[79]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[80]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.

[81]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[82]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[83]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[84]  Lukás Burget Measurement of Complementarity of Recognition Systems , 2004, TSD.

[85]  Xiaodong Cui,et al.  High-performance low-latency speech recognition via multi-layered feature streaming and fast Gaussian computation , 2008, INTERSPEECH.

[86]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[87]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[88]  I-Fan Chen,et al.  A new framework for system combination based on integrated hypothesis space , 2006, INTERSPEECH.

[89]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[90]  Georges Linarès,et al.  Generalized driven decoding for speech recognition system combination , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[91]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[92]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures vs. dynamic cache models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[93]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[94]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[95]  Mark J. F. Gales,et al.  Improving LVCSR System Combination Using Neural Network Language Model Cross Adaptation , 2011, INTERSPEECH.

[96]  Fethi Bougares,et al.  Some recent research work at LIUM based on the use of CMU Sphinx , 2010 .

[97]  Takehito Utsuro,et al.  Combining outputs of multiple LVCSR models by machine learning , 2005, Systems and Computers in Japan.

[98]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[99]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[100]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[101]  Mark J. F. Gales,et al.  Language model cross adaptation for LVCSR system combination , 2013, Comput. Speech Lang..

[102]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[103]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[104]  Alexander Seward Low-latency incremental speech transcription in the synface project , 2003, INTERSPEECH.

[105]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[106]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[107]  Olivier Galibert,et al.  THE LIMSI 2006 TC-STAR TRANSCRIPTION SYSTEMS ⁄ , 2006 .