Bayesian Active Learning for Maximal Information Gain on Model Parameters

The fact that machine learning models, despite their advancements, are still trained on randomly gathered data is proof that a lasting solution to the problem of optimal data gathering has not yet been found. In this paper, we investigate whether a Bayesian approach to the classification problem can provide assumptions under which one is guaranteed to perform at least as good as random sampling. For a logistic regression model, we show that maximal expected information gain on model parameters is a promising criterion for selecting samples, assuming that our classification model is well-matched to the data. Our derived criterion is closely related to the maximum model change. We experiment with data sets which satisfy this assumption to varying degrees to see how sensitive our performance is to the violation of our assumption in practice.

[1]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[2]  Marco Loog,et al.  A benchmark and comparison of active learning for logistic regression , 2016, Pattern Recognit..

[3]  Charles O. Marsh Introduction to Continuous Entropy , 2013 .

[4]  Marco Loog,et al.  An empirical investigation into the inconsistency of sequential active learning , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[5]  Chris H. Q. Ding,et al.  Active Learning for Classification with Maximum Model Change , 2017, ACM Trans. Inf. Syst..

[6]  Max A. Little,et al.  Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection , 2007, Biomedical engineering online.

[7]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[8]  Kristian Kersting,et al.  TUDataset: A collection of benchmark datasets for learning with graphs , 2020, ArXiv.

[9]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[10]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[13]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[14]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[15]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[16]  Peter Cheeseman,et al.  Bayesian Methods for Adaptive Models , 2011 .

[17]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[18]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[19]  Kevin Baker,et al.  Classification of radar returns from the ionosphere using neural networks , 1989 .

[20]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[21]  Kaspar Riesen,et al.  IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning , 2008, SSPR/SPR.

[22]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..