Fast Kronecker Product Kernel Methods via Generalized Vec Trick

Kronecker product kernel provides the standard approach in the kernel methods’ literature for learning from graph data, where edges are labeled and both start and end vertices have their own feature representations. The methods allow generalization to such new edges, whose start and end vertices do not appear in the training data, a setting known as zero-shot or zero-data learning. Such a setting occurs in numerous applications, including drug-target interaction prediction, collaborative filtering, and information retrieval. Efficient training algorithms based on the so-called vec trick that makes use of the special structure of the Kronecker product are known for the case where the training data are a complete bipartite graph. In this paper, we generalize these results to noncomplete training graphs. This allows us to derive a general framework for training Kronecker product kernel methods, as specific examples we implement Kronecker ridge regression and support vector machine algorithms. Experimental results demonstrate that the proposed approach leads to accurate models, while allowing order of magnitude improvements in training and prediction time.

[1]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[2]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[3]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[4]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[5]  Charles Elkan,et al.  A Log-Linear Model with Latent Features for Dyadic Prediction , 2010, 2010 IEEE International Conference on Data Mining.

[6]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[7]  Chih-Jen Lin,et al.  A Study on SMO-Type Decomposition Methods for Support Vector Machines , 2006, IEEE Transactions on Neural Networks.

[8]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[9]  Jean-Philippe Vert,et al.  On learning with kernels for unordered pairs , 2010, ICML.

[10]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[11]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[12]  W. E. Roth On direct product matrices , 1934 .

[13]  Wei Chu,et al.  Information Services]: Web-based services , 2022 .

[14]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[15]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[16]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[17]  Zhang Yi,et al.  Constructing the L2-Graph for Robust Subspace Learning and Subspace Clustering , 2012, IEEE Transactions on Cybernetics.

[18]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[19]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[20]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[21]  Reshma Khemchandani,et al.  Regularized least squares support vector regression for the simultaneous learning of a function and its derivatives , 2008, Inf. Sci..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Hisashi Kashima,et al.  Fast and Scalable Algorithms for Semi-supervised Link Prediction on Static and Dynamic Graphs , 2010, ECML/PKDD.

[24]  Tapio Pahikkala,et al.  Toward more realistic drug^target interaction predictions , 2014 .

[25]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[26]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Christopher D. Manning,et al.  Using Feature Conjunctions Across Examples for Learning Pairwise Classifiers , 2004, ECML.

[29]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[30]  P. Hajduk,et al.  Navigating the kinome. , 2011, Nature chemical biology.

[31]  Tapio Salakoski,et al.  A Kernel-Based Framework for Learning Graded Relations From Data , 2011, IEEE Transactions on Fuzzy Systems.

[32]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[33]  Tapio Pahikkala,et al.  Fast Gradient Computation for Learning with Tensor Product Kernels and Sparse Training Labels , 2014, S+SSPR.

[34]  E. Marcotte,et al.  A flaw in the typical evaluation scheme for pair-input computational predictions , 2012, Nature Methods.

[35]  M. Saunders,et al.  Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[36]  Mehryar Mohri,et al.  Magnitude-preserving ranking algorithms , 2007, ICML '07.

[37]  Hao Ding,et al.  Similarity-based machine learning methods for predicting drug-target interactions: a brief review , 2014, Briefings Bioinform..

[38]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[39]  Eyke Hüllermeier,et al.  Identification of Functionally Related Enzymes by Learning-to-Rank Methods , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Mehmet Gönen,et al.  Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization , 2012, Bioinform..

[41]  George Karypis,et al.  A Comprehensive Survey of Neighborhood-based Recommendation Methods , 2011, Recommender Systems Handbook.

[42]  Tapio Pahikkala,et al.  RLScore: Regularized Least-Squares Learners , 2016, J. Mach. Learn. Res..

[43]  Hisashi Kashima,et al.  Self-measuring Similarity for Multi-task Gaussian Process , 2011, ICML Unsupervised and Transfer Learning.

[44]  Suresh Chandra,et al.  Kernel Optimization Using a Generalized Eigenvalue Approach , 2009, PReMI.

[45]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[46]  Tapio Pahikkala,et al.  An efficient algorithm for learning to rank from preference graphs , 2009, Machine Learning.

[47]  Carla D. Moravitz Martin,et al.  Shifted Kronecker Product Systems , 2006, SIAM J. Matrix Anal. Appl..

[48]  Yoshihiro Yamanishi,et al.  propagation: A fast semisupervised learning algorithm for link prediction , 2009 .

[49]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[50]  Zhang Yi,et al.  A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[51]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[52]  R. Freund,et al.  QMR: a quasi-minimal residual method for non-Hermitian linear systems , 1991 .

[53]  Eyke Hüllermeier,et al.  Dyad Ranking Using a Bilinear Plackett-Luce Model , 2015, LWA.

[54]  Tapio Salakoski,et al.  Training linear ranking SVMs in linearithmic time using red-black trees , 2010, Pattern Recognit. Lett..

[55]  Tapio Salakoski,et al.  Conditional Ranking on Relational Data , 2010, ECML/PKDD.

[56]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[57]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[58]  Bernard De Baets,et al.  Efficient regularized least-squares algorithms for conditional ranking on relational data , 2012, Machine Learning.

[59]  Edwin V. Bonilla,et al.  Kernel Multi-task Learning using Task-specific Features , 2007, AISTATS.

[60]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[61]  Bernard De Baets,et al.  A Two-Step Learning Approach for Solving Full and Almost Full Cold Start Problems in Dyadic Prediction , 2014, ECML/PKDD.

[62]  Elena Marchiori,et al.  Gaussian interaction profile kernels for predicting drug-target interaction , 2011, Bioinform..

[63]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[64]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[65]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[66]  E. Marchiori,et al.  Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile , 2013, PloS one.

[67]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[68]  Tapio Salakoski,et al.  Large scale training methods for linear RankRLS , 2010 .

[69]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[70]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[71]  Pierre Geurts,et al.  On protocols and measures for the validation of supervised methods for the inference of biological networks , 2013, Front. Genet..