Discriminative training via minimization of risk estimates based on Parzen smoothing

We describe a new approach to estimating classification risk in the domain of a suitably defined transformation that can be used as the basis for optimization of generic pattern recognition systems, including hidden Markov models and Multi-Layer Perceptrons. The two formulations of risk estimate described here are closely tied to the Minimum Classification Error/Generalized Probabilistic Descent (MCE/GPD) framework for discriminative training that is well-known to the speech recognition community. In the new approach, high-dimensional and possibly variable-length training tokens are mapped to the centers of Parzen kernels which are then easily integrated to find the risk estimate. The utility of such risk estimates lies in the fact that they are explicit functions of the system parameters and hence suitable for use in practical optimization methods. The use of Parzen estimation makes it possible to establish convergence of the risk estimate to the true theoretical classification risk, a result that formally expresses the benefit of linking the degree of smoothing to the training set size. Convergence of the minimized risk estimate is also analyzed. The new approach establishes a more general theoretical foundation for discriminative training than existed before, supporting previous work and suggesting new variations for future work.

[1]  Shigeki Sagayama,et al.  Minimum error classification training of HMMs , 1992 .

[2]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[3]  Mehryar Mohri,et al.  Rational Kernels , 2002, NIPS.

[4]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[5]  Shigeru Katagiri,et al.  Prototype-based discriminative training for various speech units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[7]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  José Carlos Príncipe,et al.  A new clustering evaluation function using Renyi's information potential , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  L. Devroye,et al.  Nonparametric Density Estimation: The L 1 View. , 1985 .

[10]  Shigeru Katagiri,et al.  A new formalization of minimum classification error using a Parzen estimate of classification chance , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Dimitri Kanevsky A generalization of the Baum algorithm to functions on non-linear manifolds , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Erik McDermott,et al.  Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[15]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[16]  Shigeru Katagiri,et al.  A derivation of minimum classification error from the theoretical classification risk using Parzen estimation , 2004, Comput. Speech Lang..

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  S. Young,et al.  Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Tatsuya Kawahara Benchmark test for speech recognition using the Corpus of Spontaneous Japanese , 2003 .

[20]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .

[21]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[22]  David Rainton,et al.  Minimum error classification training of HMMs-Implementation details and experimental results.:Implementation details and experimental results , 1992 .

[23]  Biing-Hwang Juang,et al.  New discriminative training algorithms based on the generalized probabilistic descent method , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[24]  Donald F. Specht,et al.  Probabilistic neural networks and general regression neural networks , 1996 .

[25]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Shigeru Katagiri,et al.  Prototype-based minimum classification error/generalized probabilistic descent training for various speech units , 1994, Comput. Speech Lang..

[27]  S. Haykin,et al.  Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method , 2001 .

[28]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[29]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[30]  Alain Biem,et al.  Discriminative training for large vocabulary telephone-based name recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[31]  H. Robbins A Stochastic Approximation Method , 1951 .

[32]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Biing-Hwang Juang,et al.  Discriminative multi-layer feed-forward networks , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[34]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[35]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[36]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[37]  E. Parzen,et al.  Modern Probability Theory and Its Applications , 1960 .

[38]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[39]  Jonathan Le Roux,et al.  Optimization methods for discriminative training , 2005, INTERSPEECH.

[40]  A. Nadas,et al.  Decoder selection based on cross-entropies , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[41]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[42]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[43]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[44]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[45]  Shigeru Katagiri,et al.  A generalized probabilistic descent method , 1990 .

[46]  Chin-Hui Lee,et al.  Segmental GPD training of HMM based speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Tat-Seng Chua,et al.  A maximal figure-of-merit learning approach to text categorization , 2003, SIGIR.

[48]  Léon Bottou,et al.  Local Algorithms for Pattern Recognition and Dependencies Estimation , 1993, Neural Computation.

[49]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[50]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[51]  Alain Biem Minimum classification error training of hidden Markov models for handwriting recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[52]  Hermann Ney,et al.  Comparison of optimization methods for discriminative training criteria , 1997, EUROSPEECH.

[53]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[54]  Erik McDermott,et al.  Discriminative Training for Speech Recognition , 1997 .

[55]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..