Feature Selection in Kernel Space: A Case Study on Dependency Parsing

Given a set of basic binary features, we propose a new L1 norm SVM based feature selection method that explicitly selects the features in their polynomial or tree kernel spaces. The efficiency comes from the anti-monotone property of the subgradients: the subgradient with respect to a combined feature can be bounded by the subgradient with respect to each of its component features, and a feature can be pruned safely without further consideration if its corresponding subgradient is not steep enough. We conduct experiments on the English dependency parsing task with a third order graph-based parser. Benefiting from the rich features selected in the tree kernel space, our model achieved the best reported unlabeled attachment score of 93.72 without using any additional resource.

[1]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[2]  Alexis Nasr,et al.  Pseudo-Projectivity, A Polynomially Parsable Non-Projective Dependency Grammar , 1998, ACL.

[3]  Qi Zhang,et al.  A Progressive Feature Selection Algorithm for Ultra Large Feature Spaces , 2006, ACL.

[4]  Yuji Matsumoto,et al.  Efficient Stacked Dependency Parsing by Forest Reranking , 2013, Transactions of the Association for Computational Linguistics.

[5]  Jun Suzuki,et al.  Convolution Kernels with Feature Selection for Natural Language Processing Tasks , 2004, ACL.

[6]  Brian Kingsbury,et al.  How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets , 2014, ArXiv.

[7]  Noah A. Smith,et al.  Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers , 2013, ACL.

[8]  Ivor W. Tsang,et al.  Towards Large-scale and Ultrahigh Dimensional Feature Selection via Feature Generation , 2012, ArXiv.

[9]  Yue-Shi Lee,et al.  An Approximate Approach for Training Polynomial Kernel SVMs in Linear Time , 2007, ACL.

[10]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[11]  Hao Zhang,et al.  Enforcing Structural Diversity in Cube-pruned Dependency Parsing , 2014, ACL.

[12]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[13]  Stephen P. Boyd,et al.  Subgradient Methods , 2007 .

[14]  Joakim Nivre,et al.  Transition-based Dependency Parsing with Rich Non-local Features , 2011, ACL.

[15]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[16]  Jonas Kuhn,et al.  The Best of Both Worlds – A Graph-based Completion Model for Transition-based Parsers , 2012, EACL.

[17]  Jun'ichi Tsujii,et al.  Learning Combination Features with L1 Regularization , 2009, HLT-NAACL.

[18]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[19]  Yang Liu,et al.  Branch and Bound Algorithm for Dependency Parsing with Non-local Features , 2013, Transactions of the Association for Computational Linguistics.

[20]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[21]  Hao Zhang,et al.  Online Learning for Inexact Hypergraph Search , 2013, EMNLP.

[22]  Taku Kudo,et al.  Boosting-based Parse Reranking with Subtree Features , 2005, ACL.

[23]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[24]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[25]  Hao Zhang,et al.  Generalized Higher-Order Dependency Parsing with Cube Pruning , 2012, EMNLP.

[26]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[27]  Alexander M. Rush,et al.  Vine Pruning for Efficient Multi-Pass Dependency Parsing , 2012, NAACL.