Deep Learning in Semantic Kernel Spaces

Kernel methods enable the direct usage of structured representations of textual data during language learning and inference tasks. Expressive kernels, such as Tree Kernels, achieve excellent performance in NLP. On the other side, deep neural networks have been demonstrated effective in automatically learning feature representations during training. However, their input is tensor data, i.e., they can not manage rich structured information. In this paper, we show that expressive kernels and deep neural networks can be combined in a common framework in order to (i) explicitly model structured information and (ii) learn non-linear decision functions. We show that the input layer of a deep architecture can be pre-trained through the application of the Nystrom low-rank approximation of kernel spaces. The resulting “kernelized” neural network achieves state-of-the-art accuracy in three different tasks.

[1]  Roberto Basili,et al.  Large-Scale Kernel-Based Language Learning Through the Ensemble Nystr đdoto o ¨ m Methods , 2016, ECIR.

[2]  Ivor W. Tsang,et al.  Two-Layer Multiple Kernel Learning , 2011, AISTATS.

[3]  Alessandro Moschitti,et al.  Building structures from classifiers for passage reranking , 2013, CIKM.

[4]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[5]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[6]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[7]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[8]  Charles J. Fillmore,et al.  Frames and the semantics of understanding , 1985 .

[9]  Daniele Bonadiman,et al.  Convolutional Neural Networks vs. Convolution Kernels: Feature Engineering for Answer Sentence Reranking , 2016, NAACL.

[10]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[11]  Richard Johansson,et al.  The Effect of Syntactic Representation on Semantic Role Labeling , 2008, COLING.

[12]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[13]  Roberto Basili,et al.  Structured Lexical Similarity via Convolution Kernels on Dependency Trees , 2011, EMNLP.

[14]  Preslav Nakov,et al.  SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[15]  Yihong Gong,et al.  Deep Learning with Kernel Regularization for Visual Recognition , 2008, NIPS.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Pierre Baldi,et al.  Bridging the Gap Between Neural Network and Kernel Methods: Applications to Drug Discovery , 2011, WIRN.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Roberto Basili,et al.  Semantic Compositionality in Tree Kernels , 2014, CIKM.

[20]  Roberto Basili,et al.  Tree Kernels for Semantic Role Labeling , 2008, CL.

[21]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[22]  Slobodan Vucetic,et al.  Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[23]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[24]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[25]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[26]  Roberto Basili,et al.  KeLP at SemEval-2016 Task 3: Learning Semantic Relations between Questions and Answers , 2016, *SEMEVAL.

[27]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[28]  Dan Roth,et al.  Learning question classifiers: the role of semantic information , 2005, Natural Language Engineering.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[33]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[34]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[35]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[36]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[37]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[38]  Alessandro Moschitti,et al.  Structural Representations for Learning Relations between Pairs of Texts , 2015, ACL.

[39]  Yoram Singer,et al.  Support Vector Machines on a Budget , 2006, NIPS.

[40]  Shafiq R. Joty,et al.  ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora , 2016, *SEMEVAL.