Multi-armed bandits in the presence of side observations in social networks

We consider the decision problem of an external agent choosing to execute one of M actions for each user in a social network. We assume that observing a user's actions provides valuable information for a larger set of users since each user's preferences are interrelated with those of her social peers. This falls into the well-known setting of the multi-armed bandit (MAB) problems, but with the critical new component of side observations resulting from interactions between users. Our contributions in this work are as follows: 1) We model the MAB problem in the presence of side observations and obtain an asymptotic lower bound (as a function of the network structure) on the regret (loss) of any uniformly good policy that achieves the maximum long term average reward. 2) We propose a randomized policy that explores actions for each user at a rate that is a function of her network position. We show that this policy achieves the asymptotic lower bound on regret associated with actions that are unpopular for all the users. 3) We derive an upper bound on the regret of existing Upper Confidence Bound (UCB) policies for MAB problems modified for our setting of side observations. We present case studies to show that these UCB policies are agnostic of the network structure and this causes their regret to suffer in a network setting. Our investigations in this work reveal the significant gains that can be obtained even through static network-aware policies.