Hierarchical reinforcement learning captures sub-task information to learn modular policies that can be quickly adapted to new tasks. While hierarchies can be learned jointly with policies, this requires a lot of interaction. Traditional approaches require less data, but typically require sub-task labels to build a task hierarchy. We propose a semi-supervised constrained clustering approach to alleviate the labeling and interaction requirements. Our approach combines limited supervision with an arbitrary set of weak constraints, obtained purely from observations, that is jointly optimized to produce a clustering of the states into subtasks. We demonstrate improvement in two visual reinforcement learning tasks. Sequential decision-making problems are an important domain within artificial intelligence. With complex tasks, it is necessary to model such tasks in a hierarchical manner, consisting of a set of sub-tasks that capture sub-goals. Hierarchical methods have a rich history in artificial intelligence (Sutton et al., 1999; Mehta et al., 2008; Vezhnevets et al., 2017; Banihashemi et al., 2018), and approaches for state abstractions result in more generalizable, efficient solutions. While there are a variety of approaches using state abstraction (Bacon et al., 2017; Florensa et al., 2018; Murali et al., 2016; Hamidi et al., 2015), they require several assumptions. For example, there are several methods that use some (potentially weak) sub-task labels in order to model the decomposition. Other methods strive to find certain classes of sub-tasks or sub-task separators (e.g. bottlenecks (McGovern & Barto, 2001; Mannor et al., 2004; Simsek et al., 2005)), which limits their generality. Recent approaches such as generative models for curriculum learning (Florensa et al., 2018) or hierarchical reinforcement learning, on the other hand, require interaction or simulators that can perform roll outs in the environment. We formulate sub-task discovery as a constrained clustering (Wagstaff & Cardie, 2000; Basu et al., 2008) problem, where limited supervision is combined with an arbitrary set of weak constraints obtained purely from observations and jointly optimized to produce a distinct clustering of the states into sub-tasks. These weak constraints are purely unsupervised, assuming no sub-task labels, and do not require any simulators. Specifically, we leverage recent advancements in deep learningbased constrained clustering (Hsu & Kira, 2016; Hsu et al., 2019) and show that we are able to optimize over a set of noisy weak pair-wise constraints between states (representing noisy estimates of whether two states are similar or dissimilar, i.e. belong to the same sub-task or not). While previous work has utilized temporal information to generate constraints over objects in video (Wu et al., 2013), we explore a more general set of unsupervised constraints that can be learned in decision-making tasks. Specifically, we demonstrate two examples of constraints: 1) local constraints that capture temporal information, representing whether sequences of states belong in the same sub-task, and 2) global constraints that capture longer range similarity obtained by utilizing policy features as well as a trained autoencoder to compute distances between states. We make the following contributions: 1) We propose a general framework that combines weak evidence via constraints in a manner that is both scalable and end-to-end, 2) We define a novel way to learn weak constraints in decision-making tasks that can be automatically generated from observations, and 3) We demonstrate that the approach can work across two complex visual environments.
[1]
Alan Fern,et al.
Active Imitation Learning of Hierarchical Policies
,
2015,
IJCAI.
[2]
Wojciech Jaskowski,et al.
ViZDoom: A Doom-based AI research platform for visual reinforcement learning
,
2016,
2016 IEEE Conference on Computational Intelligence and Games (CIG).
[3]
Ian Davidson,et al.
Constrained Clustering: Advances in Algorithms, Theory, and Applications
,
2008
.
[4]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[5]
Doina Precup,et al.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
,
1999,
Artif. Intell..
[6]
Claire Cardie,et al.
Clustering with Instance-Level Constraints
,
2000,
AAAI/IAAI.
[7]
Tom Schaul,et al.
FeUdal Networks for Hierarchical Reinforcement Learning
,
2017,
ICML.
[8]
Doina Precup,et al.
The Option-Critic Architecture
,
2016,
AAAI.
[9]
Zsolt Kira,et al.
Multi-class Classification without Multi-class Labels
,
2019,
ICLR.
[10]
Zsolt Kira,et al.
Neural network-based clustering using pairwise constraints
,
2015,
ArXiv.
[11]
Alicia P. Wolfe,et al.
Identifying useful subgoals in reinforcement learning by local graph partitioning
,
2005,
ICML.
[12]
Andrew G. Barto,et al.
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density
,
2001,
ICML.
[13]
Giuseppe De Giacomo,et al.
Hierarchical Agent Supervision
,
2018,
AAMAS.
[14]
Thomas G. Dietterich,et al.
Automatic discovery and transfer of MAXQ hierarchies
,
2008,
ICML '08.
[15]
Qiang Ji,et al.
Constrained Clustering and Its Application to Face Clustering in Videos
,
2013,
2013 IEEE Conference on Computer Vision and Pattern Recognition.
[16]
Shie Mannor,et al.
Dynamic abstraction in reinforcement learning via clustering
,
2004,
ICML.
[17]
Pieter Abbeel,et al.
Automatic Goal Generation for Reinforcement Learning Agents
,
2017,
ICML.
[18]
Trevor Darrell,et al.
TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning
,
2016,
2016 IEEE International Conference on Robotics and Automation (ICRA).