The bandit problem consists of two factors, one being exploration or the collection of information on the environment and the other being the exploitation or taking bene t by choosing the optimal action in the uncertain environment. It is necessary to choose only the optimal actions for the exploitation, while the exploration or collection of information requires to take a variety of (non-optimal) actions as trials. Hence, in order to obtain the maximal cumulative gain, we need to compromise the exploration and exploitation processes. We treat a situation where our actions change the structure of the environment, of which a simple example is formulated as the lob-pass problem by Abe and Takeuchi. Usually, the environment is speci ed by a nite number of unknown parameters in the bandit problem, so that the information collection part is to estimate their true values. The present paper treats a more realistic situation of nonparametric estimation of the environment structure which includes an in nite number (a functional degrees) of unknown parameters. The asymptotically optimal strategy is given under such a circumstance, proving that the cumulative loss can be made of the order O(t ) where is an arbitrarily small constant ( > 0) and t is the number of trials, in contrast with the optimal order O(log t) in the parametric case. Index Terms|bandit problem, stochastic game, optimal strategy, nonparametric estimation, stochastic approximation. K. Hiraoka is with the Department of Information Engineering, University of Tokyo, Tokyo 113, Japan. S. Amari is with the Department of Information Engineering, University of Tokyo, Tokyo 113, Japan. He is also with Riken Frontier Research Program on Brain Information Processing, The RIKEN, Wako, Japan.
[1]
K. Chung.
On a Stochastic Approximation Method
,
1954
.
[2]
M. T. Wasan.
Stochastic Approximation
,
1969
.
[3]
Harold J. Kushner,et al.
wchastic. approximation methods for constrained and unconstrained systems
,
1978
.
[4]
K. Glazebrook.
Optimal strategies for families of alternative bandit processes
,
1983
.
[5]
P. Bickel.
Efficient and Adaptive Estimation for Semiparametric Models
,
1993
.
[6]
Naoki Abe,et al.
The “lob-pass” problem and an on-line learning model of rational choice
,
1993,
COLT '93.
[7]
Motoaki Kawanabe,et al.
Estimation of Network Parameters in Semiparametric Stochastic Perceptron
,
1994,
Neural Computation.
[8]
Barak A. Pearlmutter,et al.
Playing the matching-shoulders lob-pass game with logarithmic regret
,
1994,
COLT '94.
[9]
S. Amari,et al.
Information geometry of estimating functions in semi-parametric statistical models
,
1997
.