Sequential Information Maximization: When is Greedy Near-optimal?

Optimal information gathering is a central challenge in machine learning and science in general. A common objective that quantifies the usefulness of observations is Shannon’s mutual information, defined w.r.t. a probabilistic model. Greedily selecting observations that maximize the mutual information is the method of choice in numerous applications, ranging from Bayesian experimental design to automated diagnosis, to active learning in Bayesian models. Despite its importance and widespread use in applications, little is known about the theoretical properties of sequential information maximization, in particular under noisy observations. In this paper, we analyze the widely used greedy policy for this task, and identify problem instances where it provides provably near-maximal utility, even in the challenging setting of persistent noise. Our results depend on a natural separability condition associated with a channel injecting noise into the observations. We also identify examples where this separability parameter is necessary in the bound: if it is too small, then the greedy policy fails to select informative tests.

[1]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[2]  Andreas Krause,et al.  Submodular Surrogates for Value of Information , 2015, AAAI.

[3]  S. Luttrell The use of transinformation in the design of data sampling schemes for inverse problems , 1985 .

[4]  Tara Javidi,et al.  Active Sequential Hypothesis Testing , 2012, ArXiv.

[5]  H. Chernoff Sequential Design of Experiments , 1959 .

[6]  Maurice Queyranne,et al.  An Exact Algorithm for Maximum Entropy Sampling , 1995, Oper. Res..

[7]  Haim Kaplan,et al.  Learning with attribute costs , 2005, STOC '05.

[8]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[9]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[10]  Robert D. Nowak,et al.  Noisy Generalized Binary Search , 2009, NIPS.

[11]  Lisa Hellerstein,et al.  Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover , 2013, SODA.

[12]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[13]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[14]  Andreas Krause,et al.  Near-Optimal Bayesian Active Learning with Noisy Observations , 2010, NIPS.

[15]  Tara Javidi,et al.  Noisy Bayesian active learning , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Liu Yang,et al.  Minimax Analysis of Active Learning , 2014, J. Mach. Learn. Res..

[17]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[18]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[19]  Bradley P. Carlin,et al.  Bayesian Adaptive Methods for Clinical Trials , 2010 .

[20]  Shai Shalev-Shwartz,et al.  Efficient active learning of halfspaces: an aggressive approach , 2012, J. Mach. Learn. Res..

[21]  Mukesh K. Mohania,et al.  Decision trees for entity identification: approximation algorithms and hardness results , 2007, PODS '07.

[22]  Teresa M. Przytycka,et al.  On an Optimal Split Tree Problem , 1999, WADS.

[23]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[24]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[25]  Maria-Florina Balcan,et al.  Active Learning - Modern Learning Theory , 2016, Encyclopedia of Algorithms.

[26]  Matti Kääriäinen,et al.  Active Learning in the Non-realizable Case , 2006, ALT.

[27]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[28]  John W. Fisher,et al.  Performance Guarantees for Information Theoretic Active Inference , 2007, AISTATS.

[29]  Michael Horstein,et al.  Sequential transmission using noiseless feedback , 1963, IEEE Trans. Inf. Theory.

[30]  Alina Beygelzimer,et al.  Efficient Test Selection in Active Diagnosis via Entropy Approximation , 2005, UAI.

[31]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[32]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[33]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[34]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.