Hyperparameter Learning via Distributional Transfer

Bayesian optimisation is a popular technique for hyperparameter learning but typically requires initial exploration even in cases where similar prior tasks have been solved. We propose to transfer information across tasks using learnt representations of training datasets used in those tasks. This results in a joint Gaussian process model on hyperparameters and data representations. Representations make use of the framework of distribution embeddings into reproducing kernel Hilbert spaces. The developed method has a faster convergence compared to existing baselines, in some cases requiring only a few evaluations of the target objective.

[1]  Peter I. Frazier,et al.  Parallel Bayesian Global Optimization of Expensive Functions , 2016, Oper. Res..

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[4]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[5]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[6]  Frank Hutter,et al.  Using Meta-Learning to Initialize Bayesian Optimization of Hyperparameters , 2014, MetaSel@ECAI.

[7]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[8]  Arthur Gretton,et al.  Notes on mean embeddings and covariance operators , 2015 .

[9]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Combining meta-learning and search techniques to select parameters for support vector machines , 2012, Neurocomputing.

[10]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[11]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[12]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[13]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[14]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[15]  Anne-Laure Jousselme,et al.  A proof for the positive definiteness of the Jaccard index matrix , 2013, Int. J. Approx. Reason..

[16]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[17]  Seungjin Choi,et al.  Learning to Transfer Initializations for Bayesian Hyperparameter Optimization , 2017, ArXiv.

[18]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Garrett M. Morris,et al.  One Size Does Not Fit All: The Limits of Structure-Based Models in Drug Discovery , 2013, Journal of chemical theory and computation.

[21]  Dino Sejdinovic,et al.  Bayesian Approaches to Distribution Regression , 2017, AISTATS.

[22]  Matthias Feurer Scalable Meta-Learning for Bayesian Optimization using Ranking-Weighted Gaussian Process Ensembles , 2018 .

[23]  Lars Schmidt-Thieme,et al.  Scalable Gaussian process-based transfer surrogates for hyperparameter optimization , 2017, Machine Learning.

[24]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[25]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[26]  Eytan Bakshy,et al.  Scalable Meta-Learning for Bayesian Optimization , 2018, ArXiv.

[27]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[28]  George Papadatos,et al.  The ChEMBL database in 2017 , 2016, Nucleic Acids Res..

[29]  Gilles Blanchard,et al.  Domain Generalization by Marginal Transfer Learning , 2017, J. Mach. Learn. Res..

[30]  Stephen J. Roberts,et al.  Optimization, fast and slow: optimally switching between local and Bayesian optimization , 2018, ICML.

[31]  Aaron Klein,et al.  Bayesian Optimization with Robust Bayesian Neural Networks , 2016, NIPS.

[32]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[33]  Max Welling,et al.  BOCK : Bayesian Optimization with Cylindrical Kernels , 2018, ICML.

[34]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[35]  Carlos Soares,et al.  Report on the Experiments with Feature Selection in Meta-Level Learning , 2000 .

[36]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[37]  Michèle Sebag,et al.  Collaborative hyperparameter tuning , 2013, ICML.

[38]  Matthias W. Seeger,et al.  Scalable Hyperparameter Transfer Learning , 2018, NeurIPS.

[39]  Seungjin Choi,et al.  Learning to Warm-Start Bayesian Hyperparameter Optimization , 2017 .

[40]  Matthias Poloczek,et al.  Warm starting Bayesian optimization , 2016, 2016 Winter Simulation Conference (WSC).

[41]  Andreas Dengel,et al.  Meta-learning for evolutionary parameter optimization of classifiers , 2012, Machine Learning.

[42]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[45]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.