Graphical models and automatic speech recognition

Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recognition techniques commonly used as part of a speech recognition system can be described by a graph – this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models. Moreover, this paper shows that many advanced models for speech recognition and language processing can also be simply described by a graph, including many at the acoustic-, pronunciation-, and language-modeling levels. A number of speech recognition techniques born directly out of the graphical-models paradigm are also surveyed. Additionally, this paper includes a novel graphical analysis regarding why derivative (or delta) features improve hidden Markov model-based speech recognition by improving structural discriminability. It also includes an example where a graph can be used to represent language model smoothing constraints. As will be seen, the space of models describable by a graph is quite large. A thorough exploration of this space should yield techniques that ultimately will supersede the hidden Markov model.

[1]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[2]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[3]  Thomas L. Dean,et al.  Probabilistic Temporal Reasoning , 1988, AAAI.

[4]  Jeff A. Bilmes,et al.  Directed graphical models of classifier combination: application to phone recognition , 2000, INTERSPEECH.

[5]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[6]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[7]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[8]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[9]  Paul J. Krause,et al.  Learning probabilistic networks , 1999, The Knowledge Engineering Review.

[10]  A. Dawid Conditional Independence in Statistical Theory , 1979 .

[11]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[12]  Francine R. Chen Identification of contextual factors for pronunciation networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[14]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[15]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[16]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[17]  Wray L. Buntine A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[18]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[19]  Mark J. F. Gales,et al.  Robust speech recognition in additive and convolutional noise using parallel model combination , 1995, Comput. Speech Lang..

[20]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Christopher Meek,et al.  Causal inference and causal explanation with background knowledge , 1995, UAI.

[22]  Harriet J. Nock,et al.  Loosely coupled HMMs for ASR , 2000, INTERSPEECH.

[23]  Khalid Daoudi,et al.  Structural learning of dynamic Bayesian networks in speech recognition , 2001, INTERSPEECH.

[24]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[25]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models: performance improvements and robustness to noise , 2000, INTERSPEECH.

[26]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[27]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[28]  Jeff A. Bilmes,et al.  Stochastic perceptual speech models with durational dependence , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[30]  Beth Logan,et al.  Factorial HMMs for acoustic modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[33]  Steve Young,et al.  Speech recognition using hidden Markov model decomposition and a general background speech model , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[35]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[36]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[37]  Li Deng,et al.  Initial evaluation of hidden dynamic models on conversational speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[38]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[39]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[40]  H. Tong Non-linear time series. A dynamical system approach , 1990 .

[41]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[42]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[43]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[44]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[45]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[46]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[47]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[48]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[49]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[50]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[51]  Steven Greenberg,et al.  Integrating syllable boundary information into speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[53]  Philip C. Woodland,et al.  Hidden Markov models using vector linear prediction and discriminative output distributions , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[55]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[56]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[57]  Philip C. Woodland,et al.  Optimising hidden Markov models using discriminative output distributions , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[58]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[59]  Mohinder S. Grewal,et al.  Kalman Filtering: Theory and Practice , 1993 .

[60]  Esther Levin,et al.  Word recognition using hidden control neural architecture , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[61]  Daniel B. Rowe,et al.  Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing , 2002 .

[62]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[63]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[64]  Illtyd Trethowan Causality , 1938 .

[65]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[66]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speech signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[67]  Jeff A. Bilmes,et al.  Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[68]  S. Chen,et al.  The IBM LVCSR System Used for 1998 Mandarin Broadcast News Transcription Evaluation , 1999 .

[69]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[70]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Steve Young,et al.  A review of large-vocabulary continuous-speech recognition , 1996 .

[72]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[73]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[75]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[76]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[77]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[78]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[79]  Jeff A. Bilmes,et al.  Natural statistical models for automatic speech recognition , 1999 .

[80]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[81]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[82]  Jean-Marc Boite,et al.  Nonlinear discriminant analysis for improved speech recognition , 1997, EUROSPEECH.

[83]  Hsiao-Chuan Wang,et al.  Joint estimation of feature transformation parameters and Gaussian mixture model for speaker identification , 1999, Speech Commun..

[84]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[85]  Jeff A. Bilmes,et al.  Factored sparse inverse covariance matrices , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[86]  Geoffrey Zweig,et al.  Structurally discriminative graphical models for automatic speech recognition - results from the 2001 Johns Hopkins Summer Workshop , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[87]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[88]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[89]  M. Kadirkamanathan,et al.  Simultaneous model re-estimation from contaminated data by composed hidden Markov modeling , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[90]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[91]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[92]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[93]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[94]  Judea Pearl,et al.  An Algorithm for Deciding if a Set of Observed Independencies Has a Causal Explanation , 1992, UAI.

[95]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[96]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[97]  Mark J. F. Gales,et al.  Segmental hidden Markov models , 1993, EUROSPEECH.

[98]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[99]  Gene H. Golub,et al.  Matrix computations , 1983 .

[100]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[101]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[102]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[103]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[104]  Mari Ostendorf,et al.  Continuous Word Recognition Based on the Stochastic Segment Model , 1992 .

[105]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[106]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[107]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[108]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[109]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[110]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[111]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[112]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[113]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[114]  Judea Pearl,et al.  Equivalence and Synthesis of Causal Models , 1990, UAI.

[115]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[116]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[117]  P. Gehler,et al.  An introduction to graphical models , 2001 .

[118]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[119]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[120]  David Heckerman,et al.  Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets , 1996, Artif. Intell..

[121]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[122]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[123]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[124]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[125]  Roger K. Moore,et al.  Simultaneous recognition of concurrent speech signals using hidden Markov model decomposition , 1991, EUROSPEECH.

[126]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[127]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[128]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[129]  Jeff A. Bilmes,et al.  Buried Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[130]  Ellen Eide Automatic modeling of pronunciation variations , 1999, EUROSPEECH.

[131]  Anil K. Jain,et al.  Markov random fields : theory and application , 1993 .

[132]  Esther Levin Hidden control neural architecture modeling of nonlinear time varying systems and its applications , 1993, IEEE Trans. Neural Networks.

[133]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[134]  R. Fletcher Practical Methods of Optimization , 1988 .

[135]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[136]  Juha Karhunen,et al.  Neural approaches to independent component analysis and source separation , 1996, ESANN.

[137]  Mats Blomberg,et al.  Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system , 1982, ICASSP.

[138]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[139]  Li Deng,et al.  A Markov model containing state-conditioned second-order non-stationarity: application to speech recognition , 1995, Comput. Speech Lang..

[140]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.

[141]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[142]  David Heckerman,et al.  Dependency Networks for Density Estimation, Collaborative Filtering, and Data Visualization , 2000 .

[143]  Mehryar Mohri,et al.  The Design Principles of a Weighted Finite-State Transducer Library , 2000, Theor. Comput. Sci..

[144]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[145]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[146]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[147]  Michael I. Jordan,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[148]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[149]  Atsushi Nakamura,et al.  Speech Recognition using Hidden Markov Models , 1998 .

[150]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[151]  Jeff A. Bilmes,et al.  Data-driven extensions to HMM statistical dependencies , 1998, ICSLP.

[152]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.