Computationally Efficient M-Estimation of Log-Linear Structure Models

We describe a new loss function, due to Jeon and Lin (2006), for estimating structured log-linear models on arbitrary features. The loss function can be seen as a (generative) alternative to maximum likelihood estimation with an interesting information-theoretic interpretation, and it is statistically consistent. It is substantially faster than maximum (conditional) likelihood estimation of conditional random fields (Lafferty et al., 2001; an order of magnitude or more). We compare its performance and training time to an HMM, a CRF, an MEMM, and pseudolikelihood on a shallow parsing task. These experiments help tease apart the contributions of rich features and discriminative training, which are shown to be more than additive.

[1]  Noah A. Smith,et al.  Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language , 2005, HLT.

[2]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[3]  Taylor L. Booth,et al.  Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Giorgio Satta,et al.  Cross-Entropy and Estimation of Probabilistic Context-Free Grammars , 2006, NAACL.

[6]  Zhiyi Chi,et al.  Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[7]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[8]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[9]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[10]  Adwait Ratnaparkhi,et al.  A maximum entropy model for parsing , 1994, ICSLP.

[11]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[13]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[14]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[15]  Mark Johnson,et al.  Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[16]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[17]  J. O’Sullivan Alternating Minimization Algorithms: From Blahut-Arimoto to Expectation-Maximization , 1998 .

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[20]  Yi Lin,et al.  AN EFFECTIVE METHOD FOR HIGH-DIMENSIONAL LOG-DENSITY ANOVA ESTIMATION, WITH APPLICATION TO NONPARAMETRIC GRAPHICAL MODEL BUILDING , 2006 .

[21]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[22]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .