Inference With Classifiers: A Study of Structured Output Problems in Natural Language Processing

A large number of problems in natural language processing (NLP) involve outputs with complex structure. Conceptually in such problems, the task is to assign values to multiple variables which represent the outputs of several interdependent components. A natural approach to this task is to formulate it as a two-stage process. In the first stage, the variables are assigned initial values using machine learning based programs. In the second, an inference procedure uses the outcomes of the first stage classifiers along with domain specific constraints in order to infer a globally consistent final prediction. This dissertation introduces a framework, inference with classifiers, to study such problems. The framework is applied to two important and fundamental NLP problems that involve complex structured outputs, shallow parsing and semantic role labeling. In shallow parsing, the goal is to identify syntactic phrases in sentences, which has been found useful in a variety of large-scale NLP applications. Semantic role labeling is the task of identifying predicate-argument structure in sentences, a crucial step toward a deeper understanding of natural language. In both tasks, we develop state-of-the-art systems which have been used in practice. In this framework, we have shown the significance of incorporating constraints into the inference stage as a way to correct and improve the decisions of the stand alone classifiers. Although it is clear that incorporating constraints into inference necessarily improves global coherency, there is no guarantee of the improvement in the performance measured in terms of the accuracy of the local predictions---the metric that is of interest for most applications. We develop a better theoretic understanding of this issue. Under a reasonable assumption, we prove a sufficient condition to guarantee that using constraints cannot degrade the performance with respect to Hamming loss. In addition, we provide an experimental study suggesting that constraints can improve performance even when the sufficient conditions are not fully satisfied.

[1]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[2]  Eugene Charniak,et al.  Immediate-Head Parsing for Language Models , 2001, ACL.

[3]  Dan Roth,et al.  Part of Speech Tagging Using a Network of Linear Separators , 1998, ACL.

[4]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[5]  Leonard Pitt,et al.  A bounded approximation for the minimum cost 2-sat problem , 1992, Algorithmica.

[6]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[7]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[8]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[9]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[10]  Rina Dechter,et al.  Constraint Processing , 1995, Lecture Notes in Computer Science.

[11]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[12]  Cecilia Ovesdotter Alm,et al.  Learning Components for A Question-Answering System , 2001, TREC.

[13]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[14]  Yee Whye Teh,et al.  An Alternate Objective Function for Markovian Fields , 2002, ICML.

[15]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[16]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[17]  Tong Zhang,et al.  Text Chunking using Regularized Winnow , 2001, ACL.

[18]  Dan Roth,et al.  Linear Concepts and Hidden Variables , 2004, Machine Learning.

[19]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Dan Roth,et al.  Gene recognition based on DAG shortest paths , 2001, ISMB.

[22]  Z. Harris Co-Occurrence and Transformation in Linguistic Structure , 1957 .

[23]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[24]  Dan Roth,et al.  A Learning Approach to Shallow Parsing , 1999, EMNLP.

[25]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[26]  Manfred Jaeger,et al.  Probabilistic Classifiers and the Concepts They Recognize , 2003, ICML.

[27]  Gregory Grefenstetti,et al.  Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches , 1996 .

[28]  Dan Roth,et al.  Understanding Probabilistic Classifiers , 2001, ECML.

[29]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[30]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[31]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[32]  Tong Zhang,et al.  Text Chunking based on a Generalization of Winnow , 2002, J. Mach. Learn. Res..

[33]  Daniel Jurafsky,et al.  Semantic Role Labeling by Tagging Syntactic Chunks , 2004, CoNLL.

[34]  Nianwen Xue,et al.  Calibrating Features for Semantic Role Labeling , 2004, EMNLP.

[35]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[36]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[37]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[38]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[39]  Aravind K. Joshi,et al.  A SNoW Based Supertagger with Application to NP Chunking , 2003, ACL.

[40]  Dan Roth,et al.  Exploring evidence for shallow parsing , 2001, CoNLL.

[41]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[42]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[43]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[44]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[45]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.

[46]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[47]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[48]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[49]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[50]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[51]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[52]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[53]  Hervé Déjean,et al.  Introduction to the CoNLL-2001 shared task: clause identification , 2001, CoNLL.

[54]  Hans Weigand,et al.  Noun Phrase Representation by System Combination , 2000 .

[55]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[56]  H. Bülthoff,et al.  Learning to recognize objects , 1999, Trends in Cognitive Sciences.

[57]  Hervé Bourlard,et al.  Speech pattern discrimination and multilayer perceptrons , 1989 .

[58]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[59]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[60]  Xavier Carreras,et al.  Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling , 2004, CoNLL.

[61]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[62]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[63]  Dan Roth,et al.  The Necessity of Syntactic Parsing for Semantic Role Labeling , 2005, IJCAI.

[64]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[65]  Dan Roth,et al.  Learning a Sparse Representation for Object Detection , 2002, ECCV.

[66]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[67]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[68]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1999, COLING.

[69]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[70]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[71]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[72]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[73]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[74]  Dan Roth,et al.  An Inference Model for Semantic Entailment in Natural Language , 2005, IJCAI.

[75]  Leslie G. Valiant,et al.  Projection Learning , 1998, COLT' 98.

[76]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[77]  Dan Roth,et al.  Learning in Natural Language , 1999, IJCAI.

[78]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[79]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[80]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.