A Method of Offline Reinforcement Learning Virtual Reality Satellite Attitude Control Based on Generative Adversarial Network

Virtual reality satellites give people an immersive experience of exploring space. The intelligent attitude control method using reinforcement learning to achieve multiaxis synchronous control is one of the important tasks of virtual reality satellites. In real-world systems, methods based on reinforcement learning face safety issues during exploration, unknown actuator delays, and noise in the raw sensor data. To improve the sample efficiency and avoid safety issues during exploration, this paper proposes a new offline reinforcement learning method to make full use of samples. This method learns a policy set with imitation learning and a policy selector using a generative adversarial network (GAN). The performance of the proposed method was verified in a real-world system (reaction-wheel-based inverted pendulum). The results showed that the agent trained with our method reached and maintained a stable goal state in 10,000 steps, whereas the behavior cloning method only remained stable for 500 steps.

[1]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[2]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[5]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[6]  Jorge L. Moiola,et al.  Controlling an Inverted Pendulum with Bounded Controls , 2002 .

[7]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[8]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[9]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[12]  Dennis S. Bernstein,et al.  Asymptotic Smooth Stabilization of the Inverted 3-D Pendulum , 2009, IEEE Transactions on Automatic Control.

[13]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[14]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[15]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[16]  Russ Tedrake,et al.  Simulation-based LQR-trees with input and state constraints , 2010, 2010 IEEE International Conference on Robotics and Automation.

[17]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[18]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.