论文信息 - Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models

We describe a new loss function, due to Jeon and Lin (2006), for estimating structured log-linear models on arbitrary features. The loss function can be seen as a (generative) alternative to maximum likelihood estimation with an interesting information-theoretic interpretation, and it is statistically consistent. It is substantially faster than maximum (conditional) likelihood estimation of conditional random fields (Lafferty et al., 2001; an order of magnitude or more). We compare its performance and training time to an HMM, a CRF, an MEMM, and pseudolikelihood on a shallow parsing task. These experiments help tease apart the contributions of rich features and discriminative training, which are shown to be more than additive.

Noah A. Smith | John D. Lafferty | Douglas L. Vail | J. Lafferty

[1] Noah A. Smith,et al. Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language , 2005, HLT.

[2] Michael I. Jordan,et al. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[3] Taylor L. Booth,et al. Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[4] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5] Giorgio Satta,et al. Cross-Entropy and Estimation of Probabilistic Context-Free Grammars , 2006, NAACL.

[6] Zhiyi Chi,et al. Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[7] Ronald Rosenfeld,et al. A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[8] Andreas Stolcke,et al. An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[9] Fernando Pereira,et al. Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[10] Adwait Ratnaparkhi,et al. A maximum entropy model for parsing , 1994, ICSLP.

[11] Roni Rosenfeld,et al. A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12] Andrew McCallum,et al. Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[13] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[14] Sabine Buchholz,et al. Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[15] Mark Johnson,et al. Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[16] Andrew McCallum,et al. Piecewise Training for Undirected Models , 2005, UAI.

[17] J. O’Sullivan. Alternating Minimization Algorithms: From Blahut-Arimoto to Expectation-Maximization , 1998 .

[18] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[20] Yi Lin,et al. AN EFFECTIVE METHOD FOR HIGH-DIMENSIONAL LOG-DENSITY ANOVA ESTIMATION, WITH APPLICATION TO NONPARAMETRIC GRAPHICAL MODEL BUILDING , 2006 .

[21] Dan Klein,et al. Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[22] J. Besag. Statistical Analysis of Non-Lattice Data , 1975 .