An architecture for scalable, universal speech recognition

This thesis describes MultiSphinx, a concurrent architecture for scalable, low-latency automatic speech recognition. We first consider the problem of constructing a universal "core” speech recognizer on top of which domain and task specific adaptation layers can be constructed. We then show that when this problem is restricted to that of expanding the search space from a “core” vocabulary to a superset or this vocabulary across multiple passes of search, it allows us to effectively “factor” a recognizer into components of roughly equal complexity. We present simple but effective algorithms for constructing the reduced vocabulary and associated statistical language model from an existing system. Finally, we describe the MultiSphinx decoder architecture, which allows multiple passes of recognition to operate concurrently and incrementally, either in multiple threads in the same process, or across multiple processes on separate machines, and which allows the best possible partial results, including confidence scores, to be obtained at any time during the recognition process.

[1]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[3]  Mei Hwang Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition , 2001 .

[4]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[6]  Mosur Ravishankar,et al.  Efficient Algorithms for Speech Recognition. , 1996 .

[7]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[8]  Eric K. Ringger,et al.  Error correction via a post-processor for continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[10]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[11]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Vassilios Digalakis,et al.  Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers , 1996, IEEE Trans. Speech Audio Process..

[13]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.

[14]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[17]  Johan Schalkwyk,et al.  A generalized composition algorithm for weighted finite-state transducers , 2009, INTERSPEECH.

[18]  Tara N. Sainath Island-driven search using broad phonetic classes , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Tanja Schultz,et al.  Polyphone decision tree specialization for language adaptation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Brian Roark,et al.  Unsupervised language model adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[22]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[23]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[24]  Abeer Alwan,et al.  On the use of variable frame rate analysis in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  Shankar Kumar,et al.  Normalization of Non-Standard Words: WS '99 Final Report , 1999 .

[26]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech recognition , 1989 .

[27]  Thomas Hain,et al.  Implicit modelling of pronunciation variation in automatic speech recognition , 2005, Speech Commun..

[28]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[29]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[30]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[31]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[33]  Hermann Ney,et al.  Fast Search for Large Vocabulary Speech Recognition , 2000 .

[34]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[36]  Michael Levit,et al.  Garbage modeling with decoys for a sequential recognition scenario , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[37]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[38]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[39]  Marilyn A. Walker,et al.  The AT&t-DARPA communicator mixed-initiative spoken dialog system , 2000, INTERSPEECH.

[40]  Mari Ostendorf,et al.  Modeling uncertainty for information extraction from speech data , 2001 .

[41]  Tanja Schultz,et al.  Enhanced tree clustering with single pronunciation dictionary for conversational speech recognition , 2003, INTERSPEECH.

[42]  Alexander H. Waibel,et al.  Phonetic-distance-based hypothesis driven lexical adaptation for transcribing multlingual broadcast news , 1998, ICSLP.

[43]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[44]  Ivan Bulyko Speech recognizer optimization under speed constraints , 2010, INTERSPEECH.

[45]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[46]  Richard M. Schwartz,et al.  Search Algorithms for Software-Only Real-Time Recognition with Very Large Vocabularies , 1993, HLT.

[47]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[48]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[49]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Pablo Fetter,et al.  Detection and transcription of OOV words , 1998 .

[51]  Brian Roark,et al.  Generalized Algorithms for Constructing Statistical Language Models , 2003, ACL.

[52]  Alexander I. Rudnicky,et al.  Implicitly Supervised Language Model Adaptation for Meeting Transcription , 2007, NAACL.

[53]  Ernest Pusateri,et al.  N-best list generation using word and phoneme recognition fusion , 2001, INTERSPEECH.

[54]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[55]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[56]  Rohit Kumar,et al.  Conquestâ - An Open-Source Dialog System for Conferences , 2007, HLT-NAACL.

[57]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[58]  Lidia Mangu,et al.  Finding consensus in speech recognition , 2000 .

[59]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[61]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.

[62]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[63]  Johan Schalkwyk,et al.  Speech recognition with dynamic grammars using finite-state transducers , 2003, INTERSPEECH.

[64]  H. Ney,et al.  Phrase-based translation of speech recognizer word lattices using loglinear model combination , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[65]  Andrej Ljolje,et al.  The AT&T LVCSR-2000 System , 2000 .

[66]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[67]  George Saon,et al.  Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[68]  Guila Glosser,et al.  Cognitive Mechanisms for Processing Nonwords: Evidence from Alzheimer's Disease , 1998, Brain and Language.

[69]  Alexander I. Rudnicky,et al.  Combining mixture weight pruning and quantization for small-footprint speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[70]  Richard Shillcock,et al.  Proceedings of EUROSPEECH-1991. , 1991 .

[71]  Elmar Nöth,et al.  A category based approach for recognition of out-of-vocabulary words , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[72]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[73]  Dafna Shahaf,et al.  Towards a Theory of AI Completeness , 2007, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[74]  Alexander I. Rudnicky,et al.  The effect of lattice pruning on MMIE training , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[75]  Thomas Kemp,et al.  Modelling unknown words in spontaneous speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[76]  Franz Kummert,et al.  Incremental generation of word graphs , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[77]  Ryosuke Isotani,et al.  Parallel LVCSR Algorithm for Cellphone-Oriented Multicore Processors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[78]  Chia-Lin Kao,et al.  Parameter tuning for fast speech recognition , 2007, INTERSPEECH.

[79]  Alexander I. Rudnicky,et al.  Mixture Pruning and Roughening for Scalable Acoustic Models , 2008, ACL 2008.

[80]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[81]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[82]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[83]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[84]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[85]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[86]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[87]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[88]  Hermann Ney,et al.  Improved MLLR speaker adaptation using confidence measures for conversational speech recognition , 2000, INTERSPEECH.