Contextual-Bandit based MIMO Relay Selection Policy with Channel Uncertainty

In this work, we exploit the potential benefits of multi-arm bandit scheme in cooperative multiple-input multiple output (MIMO) wireless networks. In particular, we consider an online-policy for amplify-and-forward MIMO relay selection (RS), where relays are provided with uncertain channel state information (CSI). We design the RS policy as a sequential experience-driven learning algorithm with a contextual bandit (CB) approach, where the algorithm learns to select an optimal relay node using the imperfect CSI provided as a context vector and the past experience of rewards procured with current policy, with the aim of maximizing the cumulative mean reward over time. Further, with extensive simulation result, we demonstrate that proposed CB based RS policy achieves superior performance gains compared to conventional Gram-Schmidt method.