Learning GFlowNets from partial episodes for improved convergence and stability

Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD($\lambda$) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB($\lambda$), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB($\lambda$) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.

[1]  Bonaventure F. P. Dossou,et al.  Biological Sequence Design with GFlowNets , 2022, ICML.

[2]  Chris C. Emezue,et al.  Bayesian Structure Learning with Generative Flow Networks , 2022, UAI.

[3]  S. Levine,et al.  Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization , 2022, ICML.

[4]  Aaron C. Courville,et al.  Generative Flow Networks for Discrete Probabilistic Modeling , 2022, ICML.

[5]  Chen Sun,et al.  Trajectory Balance: Improved Credit Assignment in GFlowNets , 2022, NeurIPS.

[6]  Doina Precup,et al.  Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation , 2021, NeurIPS.

[7]  Weinan Zhang,et al.  MARS: Markov Molecular Sampling for Multi-objective Drug Discovery , 2021, ICLR.

[8]  Alex Rosenthal,et al.  DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics , 2020, Nucleic Acids Res..

[9]  Longbo Huang,et al.  Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework , 2020, AAAI.

[10]  E. Xing,et al.  Text Generation with Efficient (Soft) Q-Learning , 2021, ArXiv.

[11]  Richard Wang,et al.  AdaLead: A simple and robust adaptive greedy search algorithm for sequence design , 2020, ArXiv.

[12]  Joelle Pineau,et al.  Interference and Generalization in Temporal Difference Learning , 2020, ICML.

[13]  Shimon Whiteson,et al.  Deep Residual Reinforcement Learning , 2019, AAMAS.

[14]  Larry Rudolph,et al.  A Closer Look at Deep Policy Gradients , 2018, ICLR.

[15]  Doina Precup,et al.  Marginalized State Distribution Entropy Regularization in Policy Optimization , 2019, ArXiv.

[16]  Petros Christodoulou,et al.  Soft Actor-Critic for Discrete Action Settings , 2019, ArXiv.

[17]  Jennifer Listgarten,et al.  Conditioning by adaptive sampling for robust design , 2019, ICML.

[18]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[19]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[20]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[21]  Regina Barzilay,et al.  Junction Tree Variational Autoencoder for Molecular Graph Generation , 2018, ICML.

[22]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[23]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[26]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[27]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[28]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29]  M. Lefranc,et al.  DBAASP: database of antimicrobial activity and structure of peptides. , 2014, FEMS microbiology letters.

[30]  A. Voet,et al.  Fragment based drug design: from experimental to computational approaches. , 2012, Current medicinal chemistry.

[31]  Sergey Lyskov,et al.  PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta , 2010, Bioinform..

[32]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[33]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[34]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[37]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[38]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[39]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .