Modèles de langage ad hoc pour la reconnaissance automatique de la parole. (Ad-hoc language models for automatic speech recognition)

Les trois piliers d’un systeme de reconnaissance automatique de la parole sont le lexique,le modele de langage et le modele acoustique. Le lexique fournit l’ensemble des mots qu’il est possible de transcrire, associes a leur prononciation. Le modele acoustique donne une indication sur la maniere dont sont realises les unites acoustiques et le modele de langage apporte la connaissance de la maniere dont les mots s’enchainent.Dans les systemes de reconnaissance automatique de la parole markoviens, les modeles acoustiques et linguistiques sont de nature statistique. Leur estimation necessite de gros volumes de donnees selectionnees, normalisees et annotees.A l’heure actuelle, les donnees disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues francaise et anglaise. Ces donnees peuvent potentiellement servir a la construction du lexique et a l’estimation et l’adaptation du modele de langage. Le travail presente ici consiste a proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organise en deux parties. La premiere traite de l’utilisation des donnees presentes sur le Web pour mettre a jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L’approche proposee consiste a augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce a la formulation automatique de requetes soumises a un moteur de recherche. La phonetisation de ces mots est obtenue grâce a un phonetiseur automatique.La seconde partie presente une nouvelle maniere de considerer l’information que represente le Web et des elements de la theorie des possibilites sont utilises pour la modeliser. Un modele de langage possibiliste est alors propose. Il fournit une estimation de la possibilite d’une sequence de mots a partir de connaissances relatives a ’existence de sequences de mots sur le Web. Un modele probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est egalement presente. Plusieurs approches permettant de combiner ces modeles avec des modeles probabilistes classiques estimes sur corpus sont proposees. Les resultats montrent que combiner les modeles probabilistes et possibilistes donne de meilleurs resultats que es modeles probabilistes classiques. De plus, les modeles estimes a partir des donnees Web donnent de meilleurs resultats que ceux estimes sur corpus.

[1]  Georges Linarès,et al.  Modèles de langage probabilistes et possibilistes basés sur le Web , 2010 .

[2]  Preslav Nakov,et al.  Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution , 2005, HLT.

[3]  Georges Linarès,et al.  Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[4]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[5]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Thomas Hain,et al.  Strategies for Language Model Web-Data Collection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[8]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[9]  H. Ney,et al.  Minimum exact word error training , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[10]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[11]  Philip Clarkson,et al.  Towards improved language model evaluation measures , 1999, EUROSPEECH.

[12]  Georges Linarès,et al.  Using the World Wide Web for Learning New Words in Continuous Speech Recognition Tasks: Two Case Studies , 2009 .

[13]  Arie Tzvieli Possibility theory: An approach to computerized processing of uncertainty , 1990, J. Am. Soc. Inf. Sci..

[14]  Katsutoshi Ohtsuki,et al.  Unsupervised vocabulary expansion for automatic transcription of broadcast news , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Manuel Zahariev,et al.  A (acronyms) , 2004 .

[16]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[17]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.

[18]  Georges Linarès,et al.  Combination of probabilistic and possibilistic language models , 2010, INTERSPEECH.

[19]  Jeff A. Bilmes,et al.  GRAPHICAL MODEL REPRESENTATIONS OF WORD LATTICES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[20]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[21]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Hang Li,et al.  Base Noun Phrase Translation Using Web Data and the EM Algorithm , 2002, COLING.

[24]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[25]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[26]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[27]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[28]  Piek T. J. M. Vossen,et al.  MEANING: a Roadmap to Knowledge Technologies , 2002, RAODMAP@COLING.

[29]  Jean-Paul Haton,et al.  Événements impossibles en modélisation stochastique du langage , 2003 .

[30]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[31]  José Rouillard,et al.  Internet Documents: A Rich Source for Spoken Language Modeling , 1999 .

[32]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  Wayne H. Ward,et al.  Confidence measures for spoken dialogue systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[34]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[35]  Georges Linarès,et al.  Transcriber Driving Strategies for Transcription Aid System , 2010, LREC.

[36]  Robert Miller,et al.  Just-in-time language modelling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  Marcello Federico,et al.  Broadcast news LM adaptation over time , 2004, Comput. Speech Lang..

[38]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[39]  Alexandre Allauzen,et al.  Open vocabulary ASR for audiovisual document indexation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[40]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[41]  Walter Daelemans,et al.  Transcription of out-of-vocabulary words in large vocabulary speech recognition based on phoneme-to-grapheme conversion , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  F. Béchet LIA―PHON: Un système complet de phonétisation de textes , 2001 .

[43]  Y. Kajiura,et al.  Generating search query in unsupervised language model adaptaion using WWW , 2006 .

[44]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[45]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[46]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[47]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[49]  Georges Linarès,et al.  Probabilistic and possibilistic language models based on the world wide web , 2009, INTERSPEECH.

[50]  Gerald Penn,et al.  Web-based language modelling for automatic lecture transcription , 2007, INTERSPEECH.

[51]  Alex Waibel,et al.  TRANSCRIBING MULTILINGUAL BROADCAST NEWS USING HYPOTHESIS DRIVEN LEXICAL ADAPTATION , 1998 .

[52]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[53]  Georges Linarès,et al.  Local Methods for On-Demand Out-of-Vocabulary Word Retrieval , 2008, LREC.

[54]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[55]  Richard M. Schwartz,et al.  Automatic Detection Of New Words In A Large Vocabulary Continuous Speech Recognition System , 1989, HLT.

[56]  Michael Collins,et al.  Trigger-Based Language Modeling using a Loss-Sensitive Perceptron Algorithm , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[57]  Michael Picheny,et al.  Semantic confidence measurement for spoken dialog systems , 2005, IEEE Transactions on Speech and Audio Processing.

[58]  Georges Linarès,et al.  Integrating imperfect transcripts into speech recognition systems for building high-quality corpora , 2012, Comput. Speech Lang..

[59]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[60]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[61]  Salim Roukos,et al.  Language model adaptation via minimum discrimination information , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[62]  Kari Torkkola An efficient way to learn English grapheme-to-phoneme rules automatically , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  Akinori Ito,et al.  Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition , 2009, EURASIP J. Audio Speech Music. Process..

[64]  Georges Linarès,et al.  Enrichissement dynamique du vocabulairè a partir du Web , 2008 .

[65]  Eric Sven Ristad,et al.  A natural law of succession , 1995, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[66]  Reinhard Kneser,et al.  On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[67]  Hermann Ney,et al.  On smoothing techniques for bigram-based natural language modelling , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[68]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[69]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[70]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[71]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[72]  Panayiotis G. Georgiou,et al.  Language model adaptation using WWW documents obtained by utterance-based queries , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Alex Waibel,et al.  New developments in automatic meeting transcription , 2000, INTERSPEECH.

[74]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[75]  Timothy J. Hazen,et al.  Recognition Confidence Scoring for Use in Speech Understanding Systems , 2000 .

[76]  Alexander H. Waibel,et al.  Reducing the OOV rate in broadcast news speech recognition , 1998, ICSLP.

[77]  Frank Seide,et al.  Online vocabulary adaptation using limited adaptation data , 2007, INTERSPEECH.

[78]  Georges Linarès,et al.  Phoneme Lattice Based A* Search Algorithm for Speech Recognition , 2002, TSD.

[79]  Fabrice Lefèvre,et al.  Système du LIA pour a campagne DEFT'10 : datation et localisation d'articles de presse francophones , 2010 .

[80]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[81]  Steve Renals,et al.  Topic-based mixture language modelling , 1999, Nat. Lang. Eng..

[82]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[83]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[84]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[85]  Georges Linarès,et al.  Audio indexing on a medical video database: The AVISON project , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).

[86]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[87]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[88]  Akinori Ito,et al.  Unsupervised language model adaptation based on automatic text collection from WWW , 2006, INTERSPEECH.

[89]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[90]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[91]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[92]  Ciro Martins,et al.  Dynamic language modeling for a daily broadcast news transcription system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[93]  M. Suzuki,et al.  An unsupervised language model adaptation based on keyword clustering and query availability estimation , 2008, 2008 International Conference on Audio, Language and Image Processing.

[94]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[95]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[96]  Jean-Luc Gauvain,et al.  Developments in continuous speech dictation using the ARPA WSJ task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[97]  G. Gravier,et al.  STER evaluation campaign of rich transcription of French broadcast news , 2011 .

[98]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[99]  Ruhi Sarikaya,et al.  Rapid language model development using external resources for new spoken dialog domains , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[100]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[101]  Georges Linarès,et al.  Principes et performances du décodeur parole continue Speeral , 2002 .

[102]  Thomas Schaaf,et al.  Lecture and presentation tracking in an intelligent meeting room , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[103]  Eric Brill,et al.  Automatic question answering using the web: Beyond the Factoid , 2006, Information Retrieval.

[104]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[105]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[106]  Georges Linarès,et al.  On-demand new word learning using world wide web , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[107]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[108]  Alexandre Allauzen,et al.  Adaptation automatique du modèle de langage d'un système de transcription de journaux parlés : Modélisation probabiliste du langage naturel , 2003 .

[109]  Georges Linarès,et al.  Décodage interactif de la parole , 2010 .

[110]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[111]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[112]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[113]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[114]  Yan Huang,et al.  Vocabulary and language model adaptation using information retrieval , 2004, INTERSPEECH.

[115]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[116]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[117]  Georges Linarès,et al.  A segment-level confidence measure for Spoken Document Retrieval , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[118]  Alexandre Allauzen,et al.  Diachronic vocabulary adaptation for broadcast news transcription , 2005, INTERSPEECH.

[119]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[120]  Jie Zhu,et al.  OOV rejection algorithm based on class-fusion support vector machine for speech recognition , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[121]  Gang Li,et al.  Vocabulary and language model adaptation using just one speech file , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[122]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[123]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[124]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[125]  Anthony J. Vitale,et al.  Algorithms for Grapheme-Phoneme Translation for English and French: Applications for Database Searches and Speech Synthesis , 1997, CL.

[126]  Ciro Martins,et al.  Dynamic language modeling for European Portuguese , 2010, Comput. Speech Lang..

[127]  Anil Kumar Singh,et al.  Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training , 2009, HLT-NAACL.

[128]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[129]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[130]  G. Linarès,et al.  Classification du genre vidéo reposant sur des transcriptions automatiques , 2010, JEPTALNRECITAL.

[131]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[132]  G. Boulianne,et al.  Out-of-vocabulary word modeling using multiple lexical fillers , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[133]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[134]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[135]  Jean-Luc Gauvain,et al.  Unsupervised language model adaptation for broadcast news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[136]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[137]  Douglas B. Paul An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[138]  Marcello Federico,et al.  Broadcast news LM adaptation using contemporary texts , 2001, INTERSPEECH.

[139]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[140]  Georges Linarès,et al.  Transcription-based video genre classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.