Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions

We describe a method of incorporating task-specific cost functions into standard conditional log-likelihood (CLL) training of linear structured prediction models. Recently introduced in the speech recognition community, we describe the method generally for structured models, highlight connections to CLL and max-margin learning for structured prediction (Taskar et al., 2003), and show that the method optimizes a bound on risk. The approach is simple, efficient, and easy to implement, requiring very little change to an existing CLL implementation. We present experimental results comparing with several commonly-used methods for training structured predictors for named-entity recognition.

[1]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[2]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yee Whye Teh,et al.  An Alternate Objective Function for Markovian Fields , 2002, ICML.

[6]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[7]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[10]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[11]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[12]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[13]  Martin Jansche,et al.  Maximum Expected F-Measure Training of Logistic Regression Models , 2005, HLT.

[14]  Jun Suzuki,et al.  Training Conditional Random Fields with Multivariate Evaluation Measures , 2006, ACL.

[15]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[16]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[17]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[18]  Kentaro Torisawa,et al.  A New Perceptron Algorithm for Sequence Labeling with Non-Local Features , 2007, EMNLP-CoNLL.

[19]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[20]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[22]  Haihua Xu,et al.  Minimum tag error for discriminative training of conditional random fields , 2009, Inf. Sci..

[23]  Noah A. Smith,et al.  Softmax-Margin Training for Structured Log-Linear Models , 2010 .