Population-Based Black-Box Optimization for Biological Sequence Design

The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences--a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose Population-Based Black-Box Optimization (P3BO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing P3BO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the hyper-parameters of each of the methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that P3BO outperforms any single method in its population, proposing higher quality sequences as well as more diverse batches. As such, P3BO and Adaptive-P3BO are a crucial step towards deploying ML to real-world sequence design.

[1]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[2]  Mohamed Ahmed,et al.  Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design , 2018, ICLR.

[3]  Ethan C. Alley,et al.  Low-N protein engineering with data-efficient deep learning , 2020, Nature Methods.

[4]  Polly M. Fordyce,et al.  Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding , 2017, Proceedings of the National Academy of Sciences.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Kenneth O. Stanley,et al.  Exploiting Open-Endedness to Solve Problems Through the Search for Novelty , 2008, ALIFE.

[7]  Michèle Sebag,et al.  Toward comparison-based adaptive operator selection , 2010, GECCO '10.

[8]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[9]  Olivier Sigaud,et al.  CEM-RL: Combining evolutionary and gradient-based methods for policy search , 2018, ICLR.

[10]  David Dohan,et al.  Model-based reinforcement learning for biological sequence design , 2020, ICLR.

[11]  Jennifer Listgarten,et al.  Conditioning by adaptive sampling for robust design , 2019, ICML.

[12]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[13]  Kevin Murphy,et al.  A view of estimation of distribution algorithms through the lens of expectation-maximization , 2019, GECCO Companion.

[14]  Tie-Yan Liu,et al.  Neural Architecture Optimization , 2018, NeurIPS.

[15]  Michèle Sebag,et al.  Fitness-AUC bandit adaptive strategy selection vs. the probability matching one within differential evolution: an empirical comparison on the bbob-2010 noiseless testbed , 2010, GECCO '10.

[16]  Ivana Kruijff-Korbayová,et al.  A Portfolio Approach to Algorithm Selection , 2003, IJCAI.

[17]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[18]  Xiaowo Wang,et al.  Synthetic Promoter Design in Escherichia coli based on Generative Adversarial Network , 2019 .

[19]  Ziheng Wang,et al.  Antibody complementarity determining region design using high-capacity machine learning , 2019, bioRxiv.

[20]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[21]  Anne Brindle,et al.  Genetic algorithms for function optimization , 1980 .

[22]  Guohua Wu,et al.  Ensemble strategies for population-based optimization algorithms - A survey , 2019, Swarm Evol. Comput..

[23]  Michèle Sebag,et al.  Extreme compass and Dynamic Multi-Armed Bandits for Adaptive Operator Selection , 2009, 2009 IEEE Congress on Evolutionary Computation.

[24]  James Zou,et al.  Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for Optimizing Protein Functions , 2018, ArXiv.

[25]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[26]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[27]  Jaie C. Woodard,et al.  Survey of variation in human transcription factors reveals prevalent DNA binding changes , 2016, Science.

[28]  Kagan Tumer,et al.  Evolution-Guided Policy Gradient in Reinforcement Learning , 2018, NeurIPS.

[29]  Bin Li,et al.  Multi-strategy ensemble particle swarm optimization for dynamic optimization , 2008, Inf. Sci..

[30]  Brendan J. Frey,et al.  Generating and designing DNA with deep generative models , 2017, ArXiv.

[31]  F. Arnold Design by Directed Evolution , 1998 .

[32]  Yoav Shoham,et al.  A portfolio approach to algorithm select , 2003, IJCAI 2003.

[33]  Michel Gendreau,et al.  Hyper-heuristics: a survey of the state of the art , 2013, J. Oper. Res. Soc..

[34]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[35]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[36]  Günter Rudolph,et al.  Global Optimization by Means of Distributed Evolution Strategies , 1990, PPSN.

[37]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[38]  Zachary Wu,et al.  Machine learning-assisted directed protein evolution with combinatorial libraries , 2019, Proceedings of the National Academy of Sciences.

[39]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[40]  Xiaofang Wang,et al.  Learnable Embedding Space for Efficient Neural Architecture Compression , 2019, ICLR.

[41]  Qingfu Zhang,et al.  Adaptive Operator Selection With Bandits for a Multiobjective Evolutionary Algorithm Based on Decomposition , 2014, IEEE Transactions on Evolutionary Computation.

[42]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[43]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[44]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[45]  Xin Yao,et al.  Population-based Algorithm Portfolios with automated constituent algorithms selection , 2014, Inf. Sci..

[46]  Dick de Ridder,et al.  Designing Eukaryotic Gene Expression Regulation Using Machine Learning. , 2020, Trends in biotechnology.

[47]  D. Sculley,et al.  Using deep learning to annotate the protein universe , 2019, Nature Biotechnology.

[48]  Anshul Kundaje,et al.  Targeted optimization of regulatory DNA sequences with neural editing architectures , 2019, bioRxiv.

[49]  Jennifer Listgarten,et al.  Design by adaptive sampling , 2018, ArXiv.

[50]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[51]  G. Seelig,et al.  Human 5′ UTR design and variant effect prediction from a massively parallel translation assay , 2018, bioRxiv.

[52]  John C. Duchi,et al.  Derivative Free Optimization Via Repeated Classification , 2018, AISTATS.