A machine learning toolkit for genetic engineering attribution to facilitate biosecurity

The promise of biotechnology is tempered by its potential for accidental or deliberate misuse. Reliably identifying telltale signatures characteristic to different genetic designers, termed ‘genetic engineering attribution’, would deter misuse, yet is still considered unsolved. Here, we show that recurrent neural networks trained on DNA motifs and basic phenotype data can reach 70% attribution accuracy in distinguishing between over 1,300 labs. To make these models usable in practice, we introduce a framework for weighing predictions against other investigative evidence using calibration, and bring our model to within 1.6% of perfect calibration. Additionally, we demonstrate that simple models can accurately predict both the nation-state-of-origin and ancestor labs, forming the foundation of an integrated attribution toolkit which should promote responsible innovation and international security alike.

[1]  Ryan P. Adams,et al.  Toward machine-guided design of proteins , 2018, bioRxiv.

[2]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[3]  Marc Lipsitch,et al.  Rethinking Biosafety in Research on Potential Pandemic Pathogens , 2012, mBio.

[4]  Joanne Kamens,et al.  The Addgene repository: an international nonprofit plasmid and data resource , 2014, Nucleic Acids Res..

[5]  Samouil L. Farhi,et al.  All-optical electrophysiology in mammalian neurons using engineered microbial rhodopsins , 2014, Nature Methods.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Steven P. Millard A Quick Start , 1998 .

[8]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[9]  Jun Cheng,et al.  The Kipoi repository accelerates community exchange and reuse of predictive models for genomics , 2019, Nature Biotechnology.

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Charles R. Thomas,et al.  Distribution of the h-Index in Radiation Oncology Conforms to a Variation of Power Law: Implications for Assessing Academic Productivity , 2012, Journal of Cancer Education.

[12]  Suk Hwan Lee DNA sequence watermarking based on random circular angle , 2014, Digit. Signal Process..

[13]  Avanti Shrikumar,et al.  Calibration with Bias-Corrected Temperature Scaling Improves Domain Adaptation Under Label Shift in Modern Neural Networks , 2019, ArXiv.

[14]  Bruce Budowle,et al.  Genetics and attribution issues that confront the microbial forensics field. , 2004, Forensic science international.

[15]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[16]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[17]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[18]  Michal Brzezinski,et al.  Power laws in citation distributions: evidence from Scopus , 2014, Scientometrics.

[19]  Kun Zhang,et al.  Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissues , 2015, Nature Protocols.

[20]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[21]  Dominik Heider,et al.  DNA-based watermarks using the DNA-Crypt algorithm , 2007, BMC Bioinformatics.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Mintu Porel,et al.  Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array , 2016, Proceedings of the National Academy of Sciences.

[24]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[25]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[26]  Héctor M. Sánchez C.,et al.  Development of a confinable gene drive system in the human disease vector Aedes aegypti , 2019, bioRxiv.

[27]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[28]  Claude Roux,et al.  Forensic applications of isotope ratio mass spectrometry--a review. , 2006, Forensic science international.

[29]  A. James,et al.  The AeAct‐4 gene is expressed in the developing flight muscles of female Aedes aegypti , 2004, Insect molecular biology.

[30]  Violeta G. Lopez-Huerta,et al.  Population imaging of neural activity in awake behaving mice , 2019, Nature.

[31]  James Diggans,et al.  Next Steps for Access to Safe, Secure DNA Synthesis , 2019, Front. Bioeng. Biotechnol..

[32]  Christopher A. Voigt,et al.  Deep learning to predict the lab-of-origin of engineered DNA , 2018, Nature Communications.

[33]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[34]  Kristin H. Jarman,et al.  Stable Isotope Ratios and Forensic Analysis of Microorganisms , 2007, Applied and Environmental Microbiology.

[35]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Bruce Budowle,et al.  Building Microbial Forensics as a Response to Bioterrorism , 2003, Science.

[39]  Michael A. Henninger,et al.  High-Performance Genetically Targetable Optical Neural Silencing via Light-Driven Proton Pumps , 2010 .

[40]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[41]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Gabriel J Bowen,et al.  Stable isotopes as one of nature's ecological recorders. , 2006, Trends in ecology & evolution.

[44]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[45]  Kornel Labun,et al.  CHOPCHOP v3: expanding the CRISPR web toolbox beyond genome editing , 2019, Nucleic Acids Res..

[46]  Ameet Talwalkar,et al.  Massively Parallel Hyperparameter Tuning , 2018, ArXiv.

[47]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[48]  J. Wilberger,et al.  Survey of the h index for all of academic neurosurgery: another power-law phenomenon? , 2010, Journal of neurosurgery.

[49]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[50]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[51]  D. Relman,et al.  Microbial Forensics--"Cross-Examining Pathogens" , 2002, Science.

[52]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[53]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.