Differentially Private Regret Minimization in Episodic Markov Decision Processes

We study regret minimization in finite horizon tabular Markov decision processes (MDPs) under the constraints of differential privacy (DP). This is motivated by the widespread applications of reinforcement learning (RL) in real-world sequential decision making problems, where protecting users’ sensitive and private information is becoming paramount. We consider two variants of DP – joint DP (JDP), where a centralized agent is responsible for protecting users’ sensitive data and local DP (LDP), where information needs to be protected directly on the user side. We first propose two general frameworks – one for policy optimization and another for value iteration – for designing private, optimistic RL algorithms. We then instantiate these frameworks with suitable privacy mechanisms to satisfy JDP and LDP requirements, and simultaneously obtain sublinear regret guarantees. The regret bounds show that under JDP, the cost of privacy is only a lower order additive term, while for a stronger privacy protection under LDP, the cost suffered is multiplicative. Finally, the regret bounds are obtained by a unified analysis, which, we believe, can be extended beyond tabular MDPs.

[1]  William Yang Wang,et al.  Deep Reinforcement Learning for NLP , 2018, ACL.

[2]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[3]  Vianney Perchet,et al.  Local Differentially Private Regret Minimization in Reinforcement Learning , 2020, ArXiv.

[4]  Aaron Roth,et al.  Mechanism design in large games: incentives and privacy , 2012, ITCS.

[5]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[6]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[7]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[8]  Ness B. Shroff,et al.  Multi-Armed Bandits with Local Differential Privacy , 2020, ArXiv.

[9]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[10]  N. Hegde,et al.  Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces , 2019, NeurIPS.

[11]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[12]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13]  Xiaoyu Chen,et al.  (Locally) Differentially Private Combinatorial Semi-Bandits , 2020, ICML.

[14]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[15]  Nishant A. Mehta,et al.  Optimal Algorithms for Private Online Learning in a Stochastic Environment , 2021, ArXiv.

[16]  Emilie Kaufmann,et al.  Corrupt Bandits for Preserving Local Privacy , 2017, ALT.

[17]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[18]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19]  Martin J. Wainwright,et al.  Local privacy and statistical minimax rates , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[20]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[22]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[23]  Akshay Krishnamurthy,et al.  Private Reinforcement Learning with PAC and Regret Guarantees , 2020, ICML.

[24]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[25]  Adam D. Smith,et al.  (Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings , 2013, NIPS.

[26]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[27]  Christos Dimitrakakis,et al.  Algorithms for Differentially Private Multi-Armed Bandits , 2015, AAAI.

[28]  Aditya Gopalan,et al.  Online Learning in Kernelized Markov Decision Processes , 2019, AISTATS.

[29]  Or Sheffet,et al.  An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule , 2019, ICML.

[30]  Doina Precup,et al.  Differentially Private Policy Evaluation , 2016, ICML.

[31]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[32]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[33]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[34]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[35]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[36]  Kai Zheng,et al.  Locally Differentially Private (Contextual) Bandits Learning , 2020, NeurIPS.

[37]  Nikita Mishra,et al.  (Nearly) Optimal Differentially Private Stochastic Multi-Arm Bandits , 2015, UAI.

[38]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[39]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[40]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[41]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[42]  Sayak Ray Chowdhury,et al.  Adaptive Control of Differentially Private Linear Quadratic Systems , 2021, 2021 IEEE International Symposium on Information Theory (ISIT).

[43]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[44]  Tim Roughgarden,et al.  Private matchings and allocations , 2013, SIAM J. Comput..

[45]  Jian Tan,et al.  Local Differential Privacy for Bayesian Optimization , 2020, AAAI.

[46]  Hajime Ono,et al.  Locally Private Distributed Reinforcement Learning , 2020, ArXiv.

[47]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[48]  Abhimanyu Dubey,et al.  No-Regret Algorithms for Private Gaussian Process Bandit Optimization , 2021, AISTATS.

[49]  Karan Singh,et al.  The Price of Differential Privacy for Online Learning , 2017, ICML.

[50]  Peter Kairouz,et al.  Discrete Distribution Estimation under Local Privacy , 2016, ICML.

[51]  Teng Wang,et al.  Locally Differentially Private Data Collection and Analysis , 2019, ArXiv.

[52]  Cynthia Breazeal,et al.  Affective Personalization of a Social Robot Tutor for Children's Second Language Skills , 2016, AAAI.

[53]  Pravesh Kothari,et al.  25th Annual Conference on Learning Theory Differentially Private Online Learning , 2022 .

[54]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[55]  Roshan Shariff,et al.  Differentially Private Contextual Linear Bandits , 2018, NeurIPS.

[56]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[57]  Christos Dimitrakakis,et al.  Achieving Privacy in the Adversarial Multi-Armed Bandit , 2017, AAAI.

[58]  Elaine Shi,et al.  Private and Continual Release of Statistics , 2010, TSEC.

[59]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[60]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[61]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[62]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.