Optimizing for Measure of Performance in Max-Margin Parsing

Many learning tasks in the field of natural language processing including sequence tagging, sequence segmentation, and syntactic parsing have been successfully approached by means of structured prediction methods. An appealing property of the corresponding training algorithms is their ability to integrate the loss function of interest into the optimization process improving the final results according to the chosen measure of performance. Here, we focus on the task of constituency parsing and show how to optimize the model for the $F_{1}$ -score in the max-margin framework of a structural support vector machine (SVM). For reasons of computational efficiency, it is a common approach to binarize the corresponding grammar before training. Unfortunately, this introduces a bias during the training procedure as the corresponding loss function is evaluated on the binary representation, while the resulting performance is measured on the original unbinarized trees. Here, we address this problem by extending the inference procedure presented by Bauer et al. Specifically, we propose an algorithmic modification that allows evaluating the loss on the unbinarized trees. The new approach properly models the loss function of interest resulting in better prediction accuracy and still benefits from the computational efficiency due to binarized representation. The presented idea can be easily transferred to other structured loss functions.

[1]  Brendan J. Frey,et al.  Fast Exact Inference for Recursive Cardinality Models , 2012, UAI.

[2]  Eugene Charniak,et al.  Parsing as Language Modeling , 2016, EMNLP.

[3]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Klaus-Robert Müller,et al.  Efficient Algorithms for Exact Inference in Sequence Labeling SVMs , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Thomas Hofmann,et al.  Predicting structured objects with support vector machines , 2009, Commun. ACM.

[7]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[8]  Richard S. Zemel,et al.  HOP-MAP: Efficient Message Passing with High Order Potentials , 2010, AISTATS.

[9]  Richard S. Zemel,et al.  Structured Output Learning with High Order Loss Functions , 2012, AISTATS.

[10]  Markus Neuhäuser,et al.  Wilcoxon Signed Rank Test , 2006 .

[11]  Shinichi Nakajima,et al.  Efficient Exact Inference With Loss Augmented Objective in Structured Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[13]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[14]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[16]  Yang Wang,et al.  Optimizing Complex Loss Functions in Structured Prediction , 2010, ECCV.

[17]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[18]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[19]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[20]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[21]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[22]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[23]  Klaus-Robert Müller,et al.  Accurate Maximum-Margin Training for Parsing With Context-Free Grammars , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[24]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[25]  Adam Lopez,et al.  Training a Log-Linear Parser with Loss Functions via Softmax-Margin , 2011, EMNLP.

[26]  Dan Klein,et al.  Improving Neural Parsing by Disentangling Model Combination and Reranking Effects , 2017, ACL.

[27]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[28]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[29]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[30]  Alexander J. Smola,et al.  Bundle Methods for Machine Learning , 2007, NIPS.

[31]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[32]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[33]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[34]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[35]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .