Conditional Random Fields for Integrating Local Discriminative Classifiers

Conditional random fields (CRFs) are a statistical framework that has recently gained in popularity in both the automatic speech recognition (ASR) and natural language processing communities because of the different nature of assumptions that are made in predicting sequences of labels compared to the more traditional hidden Markov model (HMM). In the ASR community, CRFs have been employed in a method similar to that of HMMs, using the sufficient statistics of input data to compute the probability of label sequences given acoustic input. In this paper, we explore the application of CRFs to combine local posterior estimates provided by multilayer perceptrons (MLPs) corresponding to the frame-level prediction of phone classes and phonological attribute classes. We compare phonetic recognition using CRFs to an HMM system trained on the same input features and show that the monophone label CRF is able to achieve superior performance to a monophone-based HMM and performance comparable to a 16 Gaussian mixture triphone-based HMM; in both of these cases, the CRF obtains these results with far fewer free parameters. The CRF is also able to better combine these posterior estimators, achieving a substantial increase in performance over an HMM-based triphone system by mixing the two highly correlated sets of phone class and phonetic attribute class posteriors.

[1]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[2]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[3]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[5]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[6]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Steve Young,et al.  The HTK book , 1995 .

[8]  Hervé Bourlard,et al.  Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[9]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[10]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[11]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[12]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[13]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[14]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[15]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[20]  Chin-Hui Lee,et al.  Towards knowledge-based features for HMM based large vocabulary automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Li Deng,et al.  An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. , 2002, The Journal of the Acoustical Society of America.

[22]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[23]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[24]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[25]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[26]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[27]  C.-H. Lee,et al.  From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition , 2004 .

[28]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  E. Fosler-Lussier,et al.  An evaluation of hierarchical articulatory feature detectors , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[30]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[31]  Jeremy Morris,et al.  Discriminative Phonetic Recognition with Conditional Random Fields , 2006 .

[32]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[34]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[35]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[36]  Jinyu Li,et al.  Detection-based ASR in the automatic speech attribute transcription project , 2007, INTERSPEECH.

[37]  Eric Fosler-Lussier,et al.  Further Experiments with Detector-Based Conditional Random Fields in Phonetic Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[39]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[40]  Richard Rose,et al.  Integration of Multiple Feature Sets for Reducing Ambiguity in ASR , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[41]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[42]  Lawrence K. Saul,et al.  Comparison of Large Margin Training to Other Discriminative Methods for Phonetic Recognition by Hidden Markov Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.