论文信息 - Differentially Private Regret Minimization in Episodic Markov Decision Processes

Differentially Private Regret Minimization in Episodic Markov Decision Processes

We study regret minimization in finite horizon tabular Markov decision processes (MDPs) under the constraints of differential privacy (DP). This is motivated by the widespread applications of reinforcement learning (RL) in real-world sequential decision making problems, where protecting users’ sensitive and private information is becoming paramount. We consider two variants of DP – joint DP (JDP), where a centralized agent is responsible for protecting users’ sensitive data and local DP (LDP), where information needs to be protected directly on the user side. We first propose two general frameworks – one for policy optimization and another for value iteration – for designing private, optimistic RL algorithms. We then instantiate these frameworks with suitable privacy mechanisms to satisfy JDP and LDP requirements, and simultaneously obtain sublinear regret guarantees. The regret bounds show that under JDP, the cost of privacy is only a lower order additive term, while for a stronger privacy protection under LDP, the cost suffered is multiplicative. Finally, the regret bounds are obtained by a unified analysis, which, we believe, can be extended beyond tabular MDPs.

Sayak Ray Chowdhury | Xingyu Zhou | Xingyu Zhou

[1] William Yang Wang,et al. Deep Reinforcement Learning for NLP , 2018, ACL.

[2] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[3] Vianney Perchet,et al. Local Differentially Private Regret Minimization in Reinforcement Learning , 2020, ArXiv.

[4] Aaron Roth,et al. Mechanism design in large games: incentives and privacy , 2012, ITCS.

[5] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[6] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[7] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[8] Ness B. Shroff,et al. Multi-Armed Bandits with Local Differential Privacy , 2020, ArXiv.

[9] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[10] N. Hegde,et al. Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces , 2019, NeurIPS.

[11] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[12] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13] Xiaoyu Chen,et al. (Locally) Differentially Private Combinatorial Semi-Bandits , 2020, ICML.

[14] Shie Mannor,et al. Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[15] Nishant A. Mehta,et al. Optimal Algorithms for Private Online Learning in a Stochastic Environment , 2021, ArXiv.

[16] Emilie Kaufmann,et al. Corrupt Bandits for Preserving Local Privacy , 2017, ALT.

[17] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[18] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19] Martin J. Wainwright,et al. Local privacy and statistical minimax rates , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[20] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[22] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[23] Akshay Krishnamurthy,et al. Private Reinforcement Learning with PAC and Regret Guarantees , 2020, ICML.

[24] Shie Mannor,et al. Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[25] Adam D. Smith,et al. (Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings , 2013, NIPS.

[26] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[27] Christos Dimitrakakis,et al. Algorithms for Differentially Private Multi-Armed Bandits , 2015, AAAI.

[28] Aditya Gopalan,et al. Online Learning in Kernelized Markov Decision Processes , 2019, AISTATS.

[29] Or Sheffet,et al. An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule , 2019, ICML.

[30] Doina Precup,et al. Differentially Private Policy Evaluation , 2016, ICML.

[31] Fredrik D. Johansson,et al. Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.