The Synthesis Success Calculator: Predicting the Rapid Synthesis of DNA Fragments with Machine Learning

The synthesis and assembly of long DNA fragments has greatly accelerated synthetic biology and biotechnology research. However, long turnaround times or synthesis failures create unpredictable bottlenecks in the design-build-test-learn cycle. We developed a machine learning model, called the Synthesis Success Calculator, to predict whether a long DNA fragment can be readily synthesized with a short turnaround time. The model also identifies the sequence determinants associated with the synthesis outcome. We trained a random forest classifier using biophysical features and a compiled dataset of 1076 DNA fragment sequences to achieve high predictive performance (F1 score of 0.928 on 251 unseen sequences). Feature importance analysis revealed that repetitive DNA sequences were the most important contributor to synthesis failures. We then applied the Synthesis Success Calculator across large sequence datasets and found that 84.9% of the Escherichia coli MG1655 genome, but only 34.4% of sampled plasmids in NCBI, could be readily synthesized. Overall, the Synthesis Success Calculator can be applied on its own to prevent synthesis failures or embedded within optimization algorithms to design large genetic systems that can be rapidly synthesized and assembled.

[1]  Thomas H Segall-Shapiro,et al.  Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome , 2010, Science.

[2]  B. Connolly,et al.  Low-fidelity Pyrococcus furiosus DNA polymerase mutants useful in error-prone PCR. , 2004, Nucleic acids research.

[3]  Pamela A. Silver,et al.  Large-scale recoding of a bacterial genome by iterative recombineering of synthetic DNA , 2017, Nucleic acids research.

[4]  Amarda Shehu,et al.  Automated Design of Assemblable, Modular, Synthetic Chromosomes , 2009, PPAM.

[5]  Stefan Lutz,et al.  Beyond directed evolution--semi-rational protein engineering and design. , 2010, Current opinion in biotechnology.

[6]  Benjamin R. Jack,et al.  Predicting the Genetic Stability of Engineered DNA Sequences with the EFM Calculator. , 2015, ACS synthetic biology.

[7]  Daniel Neagu,et al.  Interpreting random forest models using a feature contribution method , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[10]  Matthias Christen,et al.  Genome Calligrapher: A Web Tool for Refactoring Bacterial Genome Sequences for de Novo DNA Synthesis. , 2015, ACS synthetic biology.

[11]  D. G. Gibson,et al.  Design and synthesis of a minimal bacterial genome , 2016, Science.

[12]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[13]  Woonghee Lee,et al.  Gene2Oligo: oligonucleotide design for in vitro gene synthesis , 2004, Nucleic Acids Res..

[14]  Adam P. Arkin,et al.  The Genome Project-Write , 2016, Science.

[15]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[16]  Eric Klavins,et al.  Automated design of thousands of nonrepetitive parts for engineering stable genetic systems , 2020, Nature Biotechnology.

[17]  Sean M. Halper,et al.  Simultaneous repression of multiple bacterial genes using nonrepetitive extra-long sgRNA arrays , 2019, Nature Biotechnology.

[18]  Gary S. Sayler,et al.  Codon optimization of bacterial luciferase (lux) for expression in mammalian cells , 2005, Journal of Industrial Microbiology and Biotechnology.

[19]  D. Hoover,et al.  DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. , 2002, Nucleic acids research.

[20]  Ashutosh Chilkoti,et al.  Combinatorial codon scrambling enables scalable gene synthesis and amplification of repetitive proteins , 2016, Nature materials.

[21]  Pablo Cordero,et al.  Primerize: automated primer assembly for transcribing non-coding RNA domains , 2015, Nucleic Acids Res..

[22]  Ernst Oberortner,et al.  Streamlining the Design-to-Build Transition with Build-Optimization Software Tools. , 2017, ACS synthetic biology.

[23]  Rui Gan,et al.  A Pressure Test to Make 10 Molecules in 90 Days: External Evaluation of Methods to Engineer Biology. , 2018, Journal of the American Chemical Society.