Deep learning mutation prediction enables early stage lung cancer detection in liquid biopsy

Somatic cancer mutation detection at ultra-low variant allele frequencies (VAFs) is an unmet challenge that is intractable with current state-of-the-art mutation calling methods. Specifically, the limit of VAF detection is closely related to the depth of coverage due to the requirement of multiple supporting reads in extant methods, precluding the detection of mutations at VAFs that are orders of magnitude lower than the depth of coverage. Nevertheless, the ability to detect cancer-associated mutations in ultra low VAFs is a fundamental requirement for low-tumor burden cancer diagnostics applications such as early detection, monitoring, and therapy nomination using liquid biopsy methods (cell-free DNA). Here we defined a spatial representation of sequencing information adapted for convolutional architecture that enables variant detection in a manner independent of the depth of sequencing. This method enables the detection of cancer mutations even in VAFs as low as 10 4, more than two orders of magnitude below the current state-of-theart. We validated our method on both simulated plasma and on clinical cfDNA plasma samples from cancer patients and non-cancer controls. This method introduces a new domain within bioinformatics and personalized medicine somatic whole genome mutation calling for liquid biopsy.

[1]  A. Gonzalez-Perez,et al.  Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation , 2012, Genome Medicine.

[2]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[3]  Li Zhang,et al.  Use of autocorrelation scanning in DNA copy number analysis , 2013, Bioinform..

[4]  M. Stratton,et al.  Universal Patterns of Selection in Cancer and Somatic Tissues , 2018, Cell.

[5]  Cory Y. McLean,et al.  Creating a universal SNP and small indel variant caller with deep neural networks , 2016, bioRxiv.

[6]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[7]  U. Pastorino,et al.  Quantification of free circulating DNA as a diagnostic marker in lung cancer. , 2003, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[9]  Tom R. Gaunt,et al.  Predicting the functional consequences of cancer-associated amino acid substitutions , 2013, Bioinform..

[10]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[11]  Joshua F. McMichael,et al.  Age-related cancer mutations associated with clonal hematopoietic expansion , 2014, Nature Medicine.

[12]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[13]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[14]  Noah C Welker,et al.  Fragment Length of Circulating Tumor DNA , 2016, PLoS genetics.

[15]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[16]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[17]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[18]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[19]  X. Wang,et al.  Potential Clinical Significance of a Plasma-Based KRAS Mutation Analysis in Patients with Advanced Non–Small Cell Lung Cancer , 2010, Clinical Cancer Research.

[20]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[21]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[22]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[23]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[24]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  S. Goodman,et al.  Circulating mutant DNA to assess tumor dynamics , 2008, Nature Medicine.

[28]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[31]  Vladimir Vacic,et al.  Conpair: concordance and contamination estimator for matched tumor–normal pairs , 2016, Bioinform..

[32]  Umer Zeeshan Ijaz,et al.  Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data , 2016, BMC Bioinformatics.

[33]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[34]  David A. Hendrix,et al.  A Deep Recurrent Neural Network Discovers Complex Biological Rules to Decipher RNA Protein-Coding Potential , 2017 .

[35]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[36]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[37]  Bert Vogelstein,et al.  DETECTION OF CIRCULATING TUMOR DNA IN EARLY AND LATE STAGE HUMAN MALIGNANCIES , 2014 .

[38]  V. Bafna,et al.  Virmid: accurate detection of somatic mutations with sample impurity inference , 2013, Genome Biology.

[39]  R. Strausberg,et al.  Circulating tumor DNA analysis detects minimal residual disease and predicts recurrence in patients with stage II colon cancer , 2016, Science Translational Medicine.

[40]  Remi Torracinta,et al.  Training Genotype Callers with Neural Networks , 2016, bioRxiv.

[41]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[42]  Mark D. Johnson,et al.  Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion , 2011, Proceedings of the National Academy of Sciences.

[43]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[44]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).