Infomax Strategies for an Optimal Balance Between Exploration and Exploitation

Proper balance between exploitation and exploration is what makes good decisions that achieve high reward, like payoff or evolutionary fitness. The Infomax principle postulates that maximization of information directs the function of diverse systems, from living systems to artificial neural networks. While specific applications turn out to be successful, the validity of information as a proxy for reward remains unclear. Here, we consider the multi-armed bandit decision problem, which features arms (slot-machines) of unknown probabilities of success and a player trying to maximize cumulative payoff by choosing the sequence of arms to play. We show that an Infomax strategy (Info-p) which optimally gathers information on the highest probability of success among the arms, saturates known optimal bounds and compares favorably to existing policies. Conversely, gathering information on the identity of the best arm in the bandit leads to a strategy that is vastly suboptimal in terms of payoff. The nature of the quantity selected for Infomax acquisition is then crucial for effective tradeoffs between exploration and exploitation.

[1]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[2]  John L. Kelly,et al.  A new interpretation of information rate , 1956, IRE Trans. Inf. Theory.

[3]  Sreekanth H. Chalasani,et al.  Information theory of adaptation in neurons, behavior, and mood , 2014, Current Opinion in Neurobiology.

[4]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[5]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[6]  Jeremy Wyatt,et al.  Exploration and inference in learning from reinforcement , 1998 .

[7]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[8]  Kevin D. Glazebrook,et al.  Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[11]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[12]  W. Bialek,et al.  Information flow and optimization in transcriptional regulation , 2007, Proceedings of the National Academy of Sciences.

[13]  Massimo Vergassola,et al.  ‘Infotaxis’ as a strategy for searching without gradients , 2007, Nature.

[14]  Aleksandra M Walczak,et al.  Information transmission in genetic regulatory networks: a review , 2011, Journal of physics. Condensed matter : an Institute of Physics journal.

[15]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[16]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[17]  D. Gillespie Exact Stochastic Simulation of Coupled Chemical Reactions , 1977 .

[18]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[19]  E. Siggia,et al.  Predicting embryonic patterning using mutual entropy fitness and in silico evolution , 2010, Development.

[20]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[21]  L. Goddard Information Theory , 1962, Nature.

[22]  William Bialek,et al.  Spikes: Exploring the Neural Code , 1996 .

[23]  W. Bialek Biophysics: Searching for Principles , 2012 .

[24]  T. Toffoli Physics and computation , 1982 .

[25]  Djallel Bouneffouf,et al.  Finite-time analysis of the multi-armed bandit problem with known trend , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[26]  Bruno A. Olshausen,et al.  Book Review , 2003, Journal of Cognitive Neuroscience.

[27]  Andrew R. Barron,et al.  A bound on the financial value of information , 1988, IEEE Trans. Inf. Theory.

[28]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[29]  Carl T. Bergstrom,et al.  The fitness value of information , 2005, Oikos.

[30]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[31]  S. Leibler,et al.  Phenotypic Diversity, Population Growth, and Information in Fluctuating Environments , 2005, Science.

[32]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[33]  Peter Dayan,et al.  Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , 2001 .

[34]  I. Nemenman,et al.  Information Transduction Capacity of Noisy Biochemical Signaling Networks , 2011, Science.

[35]  L. C. Thomas,et al.  Optimization over Time. Dynamic Programming and Stochastic Control. Volume 1 , 1983 .

[36]  R. Gallager Information Theory and Reliable Communication , 1968 .

[37]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[38]  M. Mézard,et al.  Information, Physics, and Computation , 2009 .

[39]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[40]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[41]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[42]  Stanislas Leibler,et al.  The Value of Information for Populations in Varying Environments , 2010, ArXiv.

[43]  T. Lai,et al.  Optimal stopping and dynamic allocation , 1987, Advances in Applied Probability.

[44]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[45]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[46]  Ilya Nemenman,et al.  Information theory and adaptation , 2010, 1011.5466.

[47]  S. Laughlin The role of sensory adaptation in the retina. , 1989, The Journal of experimental biology.

[48]  Joseph J. Atick,et al.  What Does the Retina Know about Natural Scenes? , 1992, Neural Computation.

[49]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[50]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[51]  Carl T. Bergstrom,et al.  Shannon information and biological fitness , 2004, Information Theory Workshop.