Active Learning of Hyperparameters: An Expected Cross Entropy Criterion for Active Model Selection

In standard active learning, the learner’s goal is to reduce the predictive uncertainty with as little data as possible. We consider a slightly dierent problem: the learner’s goal is to uncover latent properties of the model|e.g., which features are relevant (\active feature selection"), or the choice of hyper parameters|with as little data as possible. While the two goals are clearly related, we give examples where following the predictive uncertainty objective is suboptimal for uncovering latent parameters. We propose novel measures of information gain about the latent parameter, based on the divergence between the prior and expected posterior distribution over the latent parameter in question. Notably, this is dierent from applying Bayesian experimental design to latent variables: we give explicit examples showing that the latter objective is prone to get stuck in local minima, unlike its application the standard predictive uncertainty. Extensive evaluations show that active learning using our measures signicantly accelerates the uncovering of latent model parameters, as compared to standard version space approaches (Query-by-committee) as well as predictive uncertainty measures.

[1]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[2]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[3]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[4]  Fredrik Olsson,et al.  A Web Survey on the Use of Active Learning to Support Annotation of Text Data , 2009, HLT-NAACL 2009.

[5]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[6]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[7]  Masashi Sugiyama,et al.  Active Learning with Model Selection in Linear Regression , 2008, SDM.

[8]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[9]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[10]  Oliver Brock,et al.  Entropy-based strategies for physical exploration of the environment's degrees of freedom , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[13]  H. Akaike A new look at the statistical model identification , 1974 .

[14]  Marc Toussaint,et al.  Active exploration of joint dependency structures , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Alexander Schrijver,et al.  A Combinatorial Algorithm Minimizing Submodular Functions in Strongly Polynomial Time , 2000, J. Comb. Theory B.

[16]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[17]  Zoubin Ghahramani,et al.  Bayesian Active Learning for Classification and Preference Learning , 2011, ArXiv.

[18]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[19]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[20]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[21]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[22]  Satoru Fujishige,et al.  Polymatroidal Dependence Structure of a Set of Random Variables , 1978, Inf. Control..

[23]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[24]  Maurice Queyranne,et al.  An Exact Algorithm for Maximum Entropy Sampling , 1995, Oper. Res..