Linear Online Learning over Structured Data with Distributed Tree Kernels

Online algorithms are an important class of learning machines as they are extremely simple and computationally efficient. Kernel methods versions can handle structured data, such as trees, and achieve state-of-the-art performance. However kernelized versions of Online Learning algorithms slow down when the number of support vectors becomes large. The traditional way to cope with this problem is introducing budgets that set the maximum number of support vectors. In this paper, we investigate Distributed Trees (DT) as an efficient way to use structured data in online learning. DTs effectively embed the huge feature space of the tree fragments into small vectors, so enabling the use of linear versions of kernel machines over tree structured data. We experiment with the Passive-Aggressive (PA) algorithm by comparing the linear and the kernelized version. A massive dataset made with tree structured data is employed: it is originated from a natural language processing task, the Boundary Detection in the context of Semantic Role Labeling over Frame Net. Results on a sample of the final data show that the DTs along with the Linear PA algorithm and the Tree Kernel along with the Bundgeted PA achieve comparable results in terms of f1-measure. Finally, the exploration of the full dataset allows the former to improve the performance on the classification task, with respect to the latter.

[1]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[2]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[3]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[4]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[5]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[6]  Bennet B. Murdock,et al.  A distributed memory model for serial-order information. , 1983 .

[7]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[8]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[9]  Geoffrey E. Hinton,et al.  Distributed representations and nested compositional structure , 1994 .

[10]  James A. Anderson,et al.  A theory for the recognition of items from short memorized lists , 1973 .

[11]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[12]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[13]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[14]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[15]  Gert Cauwenberghs,et al.  SVM incremental learning, adaptation and optimization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[16]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[17]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Budget , 2008, SIAM J. Comput..

[18]  Richard Johansson,et al.  The Effect of Syntactic Representation on Semantic Role Labeling , 2008, COLING.

[19]  Slobodan Vucetic,et al.  Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[20]  Roberto Basili,et al.  Structured Lexical Similarity via Convolution Kernels on Dependency Trees , 2011, EMNLP.

[21]  Barbara Caputo,et al.  The projectron: a bounded kernel-based Perceptron , 2008, ICML '08.

[22]  F. A B I O M A S S I M O Z A N Z O T T O,et al.  A machine learning approach to textual entailment recognition , 2009 .

[23]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[24]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25]  Roberto Basili,et al.  Tree Kernels for Semantic Role Labeling , 2008, CL.

[26]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[27]  Fabio Massimo Zanzotto,et al.  Distributed Tree Kernels , 2012, ICML.

[28]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[29]  Claudio Gentile,et al.  Tracking the best hyperplane with a simple budget Perceptron , 2006, Machine Learning.

[30]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[31]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..