Privacy Preserving Reinforcement Learning over Distributed Datasets

As the IoT is becoming increasingly popular, the possibility to exchange data between a group, or fleet, of similar devices arises. This allows institutions to share their data about a specific control task to boost learning processes. However, such data sets are often confidential and cannot be shared in their raw form. We propose a privacy-preserving reinforcement learning technique that allows knowledge transfer among similar agents. Our starting point is the setting of fleet reinforcement learning (RL), in which similar agents (the fleet) need to solve a similar task. The objective of an agent is to learn a control policy that maximizes the expected cumulative reward. Before the learning process, agents are allowed to share data. However, as agents in reality have small discrepancies (e.g., degradation or production errors), not all data samples are representative for a particular agent. Therefore, knowledge needs to be transferred only when it is relevant. To construct a privacy-preserving (PP) fleet RL method, we enhance the fleet Gaussian process reinforcement learning (FGPRL) method exposed in [1] with a secure multi-party computation (SMPC) protocol. In FGPRL, the goal is to estimate the transition function, such that the optimal policy can be inferred. This is done using a Gaussian process (GP). A GP is a collection of annotated random variables, any finite number of which have a joint Gaussian distribution. In a regression context, these random variables are the outputs of an unknown function, and their annotations are the inputs to that function. The most important parameter of a GP is its covariance kernel, which describes how these outputs are correlated, i.e., how knowledge about one output gives information about another. Coregionalization extends this idea to the outputs over multiple agents. A coregionalized GP is used in FGPRL as the joint transition model, such that the transitions between fleet members can be correlated in order to effectively share data based on the similarities between the members. In order to perform prediction on new input points X∗, a GP is conditioned on the fleet’s training data. The following formulas can be used to obtain the posterior statistics of f∗ := f(X∗) E[f∗] = K(X∗, X)K(X,X)−1y (1)