Behavioral models of strategies in multi-armed bandit problems
暂无分享,去创建一个
In multi-armed bandit problems, agents must repeatedly choose among uncertain alternatives whose true values they can learn about only through experimentation. Information acquired from experimentation is valuable because it tells the agent whether to select a particular option again in the future. Economically significant applications include brand choice, natural resource exploration, research and development and, as special cases, job and price search.
Despite the importance of these applications, little is known about whether firms and individuals appreciate the value of information in bandit problems. That which is known is based on laboratory and field studies of search problems. These studies suggest that people do not search enough, perhaps because of search cost or risk aversion. This thesis attempts to ascertain whether this undervaluation of information extends to the more general bandit environment, and, if so, whether the suboptimality is attributable to search cost, risk aversion, or some other cause.
The results of three laboratory experiments, each addressing a separate family of putative explanations for undervaluation of information in bandits, are presented. The first asks subjects to choose among a set of uncertain alternatives, controlling for mean-conditional risk and search cost. Although subjects appreciate that there is value to information, they experiment less than the optimal amount. Since there is no experimentation cost and mean-conditional risk is constant, these explanations cannot be the primary cause of underexperimentation.
The second experiment uses a more powerful design, asking subjects to report their Gittins indexes, rather than just make a choice. This additional information is used to test that agents are hyperbolic discounters who do not experiment enough because they are disproportionately tempted to maximize their current payoff at the expense of future payoffs. This, too, does not appear to be a primary explanation for underexperimentation because the agent's level of present bias changes over time, contrary to an assumption of the model.
The third experiment tests whether ambiguity aversion, or distaste for variance in the distribution from which the means of the payoff distributions are drawn, contributes to undervaluation of information. Consistent with a prediction of ambiguity aversion, subjects have both lower-than-optimal Gittins indexes and higher-than-optimal willingness to pay for information about the true values of ambiguous alternatives. These results are not consistent with hyperbolic discounting, risk aversion or quantal response behavior. However, the errors vary only with changes in the bandit's horizon, not with small changes in mean and variance as ambiguity aversion predicts.
Taken together, these experiments suggest ambiguity aversion is a likely cause of suboptimal play in bandits, as is cognitive shortcuts used in formulating and solving the dynamic programming problem. If these errors can be demonstrated across a wide enough set of bandits, in the field as well as in the laboratory, then policies can be developed based on this behavioral understanding of choice. These policies can improve the welfare of the workers, shoppers and firms who have to solve bandit problems.