Utility Decomposition with Deep Corrections for Scalable Planning under Uncertainty

Decomposition methods have been proposed in the past to approximate solutions to large sequential decision making problems. In contexts where an agent interacts with multiple entities, utility decomposition can be used where each individual entity is considered independently. The individual utility functions are then combined in real time to solve the global problem. Although these techniques can perform well empirically, they sacrifice optimality. This paper proposes an approach inspired from multi-fidelity optimization to learn a correction term with a neural network representation. Learning this correction can significantly improve performance. We demonstrate this approach on a pedestrian avoidance problem for autonomous driving. By leveraging strategies to avoid a single pedestrian, the decomposition method can scale to avoid multiple pedestrians. We verify empirically that the proposed correction method leads to a significant improvement over the decomposition method alone and outperforms a policy trained on the full scale problem without utility decomposition.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Hao Shen,et al.  Learning to walk with prior knowledge , 2017, 2017 IEEE International Conference on Advanced Intelligent Mechatronics (AIM).

[3]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[4]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[5]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Mykel J. Kochenderfer,et al.  Short-term conflict resolution for unmanned aircraft traffic management , 2015, 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC).

[7]  Emilio Frazzoli,et al.  Intention-Aware Motion Planning , 2013, WAFR.

[8]  Mykel J. Kochenderfer,et al.  Decomposition Methods for Optimized Collision Avoidance with Multiple Threats , 2012 .

[9]  Rüdiger Dillmann,et al.  Probabilistic decision-making under uncertainty for autonomous driving using continuous POMDPs , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[10]  Ilan Kroo,et al.  A Multifidelity Gradient-Free Optimization Method and Application to Aerodynamic Design , 2008 .

[11]  Julio K. Rosenblatt,et al.  Optimal Selection of Uncertain Actions by Maximizing Expected Utility , 1999, Proceedings 1999 IEEE International Symposium on Computational Intelligence in Robotics and Automation. CIRA'99 (Cat. No.99EX375).

[12]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[13]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[14]  Daniel M. Dunlavy,et al.  Formulations for Surrogate-Based Optimization with Data Fit, Multifidelity, and Reduced-Order Models , 2006 .

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Jonathan P. How,et al.  Real-World Reinforcement Learning via Multifidelity Simulators , 2015, IEEE Transactions on Robotics.

[17]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[18]  David Hsu,et al.  Integrated Perception and Planning in the Continuous Space: A POMDP Approach , 2013, Robotics: Science and Systems.