Vers le temps réel en transcription automatique de la parole grand vocabulaire

Transcrire automatiquement la parole contenue dans un ux a udio ne releve plus aujourd'hui del'utopie scienti que : les systemes actuels, bases gen eralement sur les modeles Markoviens, sont tresperformants et leur utilisation dans des contextes applica tifs exigeants (indexation automatique,...)est desormais envisageable. Pour autant, si une utilisati on o -line s'avere possible, ces systemes sontgeneralement beaucoup trop lents poure^tre utilises da ns des contextes applicatifs "temps-reels" telsque le sous-titrage ou la traduction automatiques, le dialo gue homme-machine :::Le travail e ectueau cours de cette these s'attache alorsaproposer des met hodes de reduction du temps de calculdes systemes de transcription en vue de permettre leur util isation dans de tels contextes. Nous noussommes particulierement concentressur le calcul des pro babilites, ta^che occupantaelle seule souventplus de la moitiedu temps global de traitement.Pour evaluer les approches developpees, un systeme de r econnaissance de reference doit e^treimplemente. Nous avons ainsi construit et amelioreun s ysteme de transcription grand vocabulaireet ceci en s'appuyant sur le corpus radiophonique distribu eal'occasion de la campagne d'evaluationESTER.Lesdistributions des modelesacoustiquesutilisesparl es systemessontgeneralementrepresenteespar des melanges a composantes gaussiennes et le calcul de s probabilites d'emission est parti-culierement lie au nombre de gaussiennes considerees d ans ces melanges. Etant donne que seule-ment certaines de ces gaussiennes ont un reel impact sur le d ecodage, notre travail s'est portesurl'evaluation de methodes de selection de gaussiennes. E n pratique, ces methodes sont basees sur laclassi cation. Lorsque les gaussiennes de chaque melange sont regroupees dans une structure arbo-rescente, un parcours de l'arbre depuis sa racine permet de r etrouver la feuille la plus proche desdonnees de test. Les distributions gaussiennes situees  a ce niveau sont selectionnees. Cette approchen'etant pas optimale, nous avons proposeun partitionnem ent hierarchiquebasesur la similariteentreles distributions. La coupure de l'arbreades hauteurs di erentes permet de de nir plusieurs niveauxde classi cation correspondant chacunaune selection de gaussiennes. Les distributions choisies sontal'intersection de toutes les selections.1

[1]  Gérard Chollet,et al.  Efficient Gaussian Mixture for Speech Recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[2]  Gérard Chollet Evaluation of ASR Systems, Algorithms and Databases , 1995 .

[3]  Mark J. F. Gales,et al.  State-based Gaussian selection in large vocabulary continuous speech recognition using HMMs , 1999, IEEE Trans. Speech Audio Process..

[4]  Monika Woszczyna,et al.  Minimizing search errors due to delayed bigrams in real-time speech recognition systems , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Ananth Sankar,et al.  Parameter tying and gaussian clustering for faster, better, and smaller speech recognition , 1999, EUROSPEECH.

[6]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[7]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[8]  H. Ney,et al.  LOOK-AHEAD TECHNIQUES FOR IMPROVED BEAM , 1996 .

[9]  References , 1971 .

[10]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Alexander H. Waibel,et al.  Speeding up the score computation of HMM speech regognizers with the bucket voronoi intersection algorithm , 1995, EUROSPEECH.

[12]  M. Seck,et al.  Détection de ruptures et suivi de classe de sons pour l'indexation sonore , 2001 .

[13]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[14]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[15]  Julien Pinquier,et al.  A fusion study in speech / music classification , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[16]  Ivica Rogina,et al.  The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[18]  S. Ortmanns,et al.  Progress in dynamic programming search for LVCSR , 1997, Proceedings of the IEEE.

[19]  Kiyohiro Shikano,et al.  Gaussian mixture selection using context-independent HMM , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[21]  Denis Jouvet,et al.  Sequential clustering algorithm for Gaussian mixture initialization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Mark J. F. Gales,et al.  Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Enrico Bocchieri,et al.  Vector quantization for the efficient computation of continuous density likelihoods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Steve J. Young,et al.  Statistical Modeling in Continuous Speech Recognition (CSR) , 2001, UAI.

[26]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[27]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[28]  Mei-Yuh Hwang,et al.  Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[29]  Vassilios Digalakis,et al.  Efficient speech recognition using subvector quantization and discrete-mixture HMMS , 2000, Comput. Speech Lang..

[30]  Richard M. Stern,et al.  THE 1999 CMU 10X REAL TIME BROADCAST NEWS TRANSCRIPTION SYSTEM , 1999 .

[31]  Lalit R. Bahl,et al.  Partitioning the feature space of a classifier with linear hyperplanes , 1999, IEEE Trans. Speech Audio Process..

[32]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[33]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  L. Lamel,et al.  Large-vocabulary continuous speech recognition: advances and applications , 2000, Proceedings of the IEEE.

[35]  Chafic Mokbel,et al.  Online adaptation of HMMs to real-life conditions: a unified framework , 2001, IEEE Trans. Speech Audio Process..

[36]  Brian Kingsbury,et al.  Reducing errors by increasing the error rate: MLP Acoustic Modeling for Broadcast News Transcription , 1999 .

[37]  David Pallett,et al.  A look at NIST'S benchmark ASR tests: past, present, and future , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[38]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[39]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[40]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[41]  A. Aiyer,et al.  Rapid likelihood calculation of subspace clustered Gaussian components , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[42]  P. Delacourt La segmentation et le regroupement par locuteurs pour l'indexation de documents audio , 2000 .

[43]  Monika Woszczyna,et al.  Fast speaker independent large vocabulary continuous speech recognition , 1998 .

[44]  Hermann Ney,et al.  Look-ahead techniques for fast beam search , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Brian Kan-Wing Mak,et al.  Subspace distribution clustering hidden Markov model , 2001, IEEE Trans. Speech Audio Process..

[46]  Gary D. Cook,et al.  Real-time recognition of broadcast news , 1998, ICSLP.

[47]  Xiao Li,et al.  Feature pruning in likelihood evaluation of HMM-based speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[48]  Vassilios Digalakis,et al.  Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers , 1996, IEEE Trans. Speech Audio Process..

[49]  Michael Picheny,et al.  Decision-tree based feature-space quantization for fast Gaussian computation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[50]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[51]  Brian K Mak,et al.  Towards A Compact Speech Recognizer: Subspace Distribution ClusteringHidden Markov Model , 1998 .

[52]  Jean-Luc Gauvain,et al.  Transcription de la parole conversationnelle , 2004 .

[53]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.

[54]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[55]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[56]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[57]  Imre Kiss,et al.  Gaussian Selection with Non-Overlapping Clusters for ASR in Embedded Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[58]  Alexander I. Rudnicky,et al.  On improvements to CI-based GMM selection , 2005, INTERSPEECH.

[59]  Alexander I. Rudnicky,et al.  Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems , 2004, INTERSPEECH.

[60]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[61]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[62]  Julien Pinquier,et al.  Indexation sonore : recherche de composantes primaires pour une structuration audiovisuelle. (Audio classification: search of primary components for audiovisual structuring) , 2004 .

[63]  Satoshi Takahashi,et al.  Four-level tied-structure for efficient representation of acoustic modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[64]  Andreas Stolcke,et al.  Improved modeling and efficiency for automatic transcription of Broadcast News , 2002, Speech Commun..

[65]  Vassilios Digalakis,et al.  Techniques to Achieve an Accurate Real-Time Large-Vocabulary Speech Recognition System , 1994, HLT.

[66]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[67]  Hervé Bourlard,et al.  Robust HMM-based speech/music segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Hermann Ney,et al.  Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition , 1997, EUROSPEECH.

[69]  Frédéric Bimbot,et al.  Sirocco, un système ouvert de reconnaissance de la parole. , 2002 .

[70]  Xiao Li,et al.  Feature pruning for low-power ASR systems in clean and noisy environments , 2005, IEEE Signal Processing Letters.

[71]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[73]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[74]  Michael Picheny,et al.  Large-Vocabulary Speech Recognition Algorithms , 2002, Computer.

[75]  Roberto Bisiani,et al.  Sub-vector clustering to improve memory and speed performance of acoustic likelihood computation , 1997, EUROSPEECH.

[76]  Kamel Smaïli,et al.  Reconnaissance Automatique de la Parole Du signal à son interprétation , 2006 .

[77]  Carl Eklund,et al.  National Institute for Standards and Technology , 2009, Encyclopedia of Biometrics.

[78]  Etienne Barnard,et al.  Stream derivation and clustering scheme for subspace distribution clustering hidden Markov model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.