Fast Kronecker product kernel methods via sampled vec trick

Kronecker product kernel provides the standard approach in the kernel methods literature for learning from pair-input data, where both data points and prediction tasks have their own feature representations. The methods allow simultaneous generalization to both new tasks and data unobserved in the training set, a setting known as zero-shot or zero-data learning. Such a setting occurs in numerous applications, including drug-target interaction prediction, collaborative filtering and information retrieval. Efficient training algorithms based on the so-called vec trick, that makes use of the special structure of the Kronecker product, are known for the case where the output matrix for the training set is fully observed, i.e. the correct output for each data point-task combination is available. In this work we generalize these results, proposing an efficient algorithm for sampled Kronecker product multiplication, where only a subset of the full Kronecker product is computed. This allows us to derive a general framework for training Kronecker kernel methods, as specific examples we implement Kronecker ridge regression and support vector machine algorithms. Experimental results demonstrate that the proposed approach leads to accurate models, while allowing order of magnitude improvements in training and prediction time.

[1]  Mehryar Mohri,et al.  Magnitude-preserving ranking algorithms , 2007, ICML '07.

[2]  Hao Ding,et al.  Similarity-based machine learning methods for predicting drug-target interactions: a brief review , 2014, Briefings Bioinform..

[3]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[4]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[5]  E. Marchiori,et al.  Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile , 2013, PloS one.

[6]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[7]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[8]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[9]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[10]  Tapio Pahikkala,et al.  RLScore: Regularized Least-Squares Learners , 2016, J. Mach. Learn. Res..

[11]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[12]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[13]  Meila,et al.  Kernel multitask learning using task-specific features , 2007 .

[14]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[15]  Zhang Yi,et al.  Constructing the L2-Graph for Robust Subspace Learning and Subspace Clustering , 2012, IEEE Transactions on Cybernetics.

[16]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[17]  Pierre Geurts,et al.  On protocols and measures for the validation of supervised methods for the inference of biological networks , 2013, Front. Genet..

[18]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[19]  Elena Marchiori,et al.  Gaussian interaction profile kernels for predicting drug-target interaction , 2011, Bioinform..

[20]  Tapio Pahikkala,et al.  An efficient algorithm for learning to rank from preference graphs , 2009, Machine Learning.

[21]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[22]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[23]  Christopher D. Manning,et al.  Using Feature Conjunctions across Examples for Learning Pairwise Classifiers , 2005 .

[24]  Hisashi Kashima,et al.  Fast and Scalable Algorithms for Semi-supervised Link Prediction on Static and Dynamic Graphs , 2010, ECML/PKDD.

[25]  Hisashi Kashima,et al.  Self-measuring Similarity for Multi-task Gaussian Process , 2011, ICML Unsupervised and Transfer Learning.

[26]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Christopher D. Manning,et al.  Using Feature Conjunctions Across Examples for Learning Pairwise Classifiers , 2004, ECML.

[29]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[30]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[31]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[32]  Chih-Jen Lin,et al.  A Study on SMO-Type Decomposition Methods for Support Vector Machines , 2006, IEEE Transactions on Neural Networks.

[33]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[34]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[35]  R. Freund,et al.  QMR: a quasi-minimal residual method for non-Hermitian linear systems , 1991 .

[36]  Eyke Hüllermeier,et al.  Dyad Ranking Using a Bilinear Plackett-Luce Model , 2015, LWA.

[37]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[38]  Tapio Salakoski,et al.  Training linear ranking SVMs in linearithmic time using red-black trees , 2010, Pattern Recognit. Lett..

[39]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[40]  Bernard De Baets,et al.  A Two-Step Learning Approach for Solving Full and Almost Full Cold Start Problems in Dyadic Prediction , 2014, ECML/PKDD.

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[43]  Reshma Khemchandani,et al.  Regularized least squares support vector regression for the simultaneous learning of a function and its derivatives , 2008, Inf. Sci..

[44]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[45]  Chih-Jen Lin,et al.  Large-Scale Linear RankSVM , 2014, Neural Computation.

[46]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[47]  Jean-Philippe Vert,et al.  On learning with kernels for unordered pairs , 2010, ICML.

[48]  S. Smale,et al.  Shannon sampling II: Connections to learning theory , 2005 .

[49]  M. Saunders,et al.  Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[50]  Mehmet Gönen,et al.  Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization , 2012, Bioinform..

[51]  W. E. Roth On direct product matrices , 1934 .

[52]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[53]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[54]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[55]  Charles Elkan,et al.  A Log-Linear Model with Latent Features for Dyadic Prediction , 2010, 2010 IEEE International Conference on Data Mining.

[56]  Tapio Salakoski,et al.  Large scale training methods for linear RankRLS , 2010 .

[57]  P. Hajduk,et al.  Navigating the kinome. , 2011, Nature chemical biology.

[58]  Tapio Salakoski,et al.  A Kernel-Based Framework for Learning Graded Relations From Data , 2011, IEEE Transactions on Fuzzy Systems.

[59]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[60]  Tapio Pahikkala,et al.  Fast Gradient Computation for Learning with Tensor Product Kernels and Sparse Training Labels , 2014, S+SSPR.

[61]  E. Marcotte,et al.  A flaw in the typical evaluation scheme for pair-input computational predictions , 2012, Nature Methods.

[62]  Carla D. Moravitz Martin,et al.  Shifted Kronecker Product Systems , 2006, SIAM J. Matrix Anal. Appl..

[63]  Yoshihiro Yamanishi,et al.  propagation: A fast semisupervised learning algorithm for link prediction , 2009 .

[64]  Wei Chu,et al.  Information Services]: Web-based services , 2022 .

[65]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[66]  Bernard De Baets,et al.  Efficient regularized least-squares algorithms for conditional ranking on relational data , 2012, Machine Learning.

[67]  Edwin V. Bonilla,et al.  Kernel Multi-task Learning using Task-specific Features , 2007, AISTATS.

[68]  Tapio Salakoski,et al.  Conditional Ranking on Relational Data , 2010, ECML/PKDD.

[69]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[70]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[71]  L. Bottou,et al.  1 Support Vector Machine Solvers , 2007 .

[72]  Tapio Pahikkala,et al.  Toward more realistic drug^target interaction predictions , 2014 .