Corrections to "Satisficing in Multiarmed Bandit Problems"
暂无分享,去创建一个
An unfortunate mistake in the proof of Theorem 8 of the above paper is corrected. We correct an error in the published proof of Theorem 8 of [1]. The error arises from an incorrect application of concentration inequalities. The correction follows the same structure as that published in [2, Appendix G], which corrects the proofs of performance bounds for UCL algorithms in [3] and thus [1, Theorem 7]. The heuristic value Qi in [1, (27)] is Qi = μ t i + σ t iΦ −1(1− αt). (C1) To correct Theorem 8 of [1], set αt = 1/(Kt) with a > 4/(3(1− /16)), ∈ (0, 4), and K = √ 2πe. The last part of the statement of [1, Theorem 8] should be replaced by “Then, the following statements hold for the satisfaction-in-mean-reward UCL algorithm with uncorrelated uninformative prior and K = √ 2πe: 1) the expected number of times a non-satisfying arm i is chosen until time T satisfies E [ ni ] ≤ ( 8a ( ∆i )2 ) log T + o(log T ); 2) the cumulative expected satisfaction-in-meanreward regret until time T satisfies
[1] Vaibhav Srivastava,et al. Modeling Human Decision Making in Generalized Gaussian Multiarmed Bandits , 2013, Proceedings of the IEEE.
[2] Vaibhav Srivastava,et al. Satisficing in Multi-Armed Bandit Problems , 2015, IEEE Transactions on Automatic Control.