Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Background Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. Here we present a machine learning-based method to distinguish artifacts from bona fide Single Nucleotide Variants (SNVs) detected by NGS from tumor specimens. Methods A cohort of 11,278 SNVs identified through clinical sequencing of tumor specimens were collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A three-class (real, artifact and uncertain) model was developed on the training set, fine-tuned using the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants. Results The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5,587 SNVs of the test set. 1,252 out of 1,341 true positive variants were identified as real, 4,143 out of 4,246 false positive calls were deemed artifacts, while only 192(3.4%) SNVs were labeled as “uncertain” with zero misclassification between the true positives and artifacts in the test set. Conclusions We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received a definitive label and thus were exempt from manual review. This framework could improve quality and efficiency of variant review process in clinical labs.

[1]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[2]  Trevor J. Hastie,et al.  Confidence intervals for random forests: the jackknife and the infinitesimal jackknife , 2013, J. Mach. Learn. Res..

[3]  C. Swanton,et al.  Resolving genetic heterogeneity in cancer , 2019, Nature Reviews Genetics.

[4]  Alexander Dobrovic,et al.  Sequence artifacts in DNA from formalin-fixed tissues: causes and strategies for minimization. , 2015, Clinical chemistry.

[5]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[6]  Michael A. Gonzalez,et al.  Rapid and accurate interpretation of clinical exomes using Phenoxome: a computational phenotype-driven approach , 2018, European Journal of Human Genetics.

[7]  E. V. Van Allen,et al.  Clinical analysis and interpretation of cancer genome data. , 2013, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Georgina L Ryland,et al.  A simple consensus approach improves somatic mutation prediction accuracy , 2013, Genome Medicine.

[9]  Marilyn M. Li,et al.  Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. , 2017, The Journal of molecular diagnostics : JMD.

[10]  R. Daniel Kortschak,et al.  A comparative analysis of algorithms for somatic SNV detection in cancer , 2013, Bioinform..

[11]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[12]  Ewa A. Bergmann,et al.  Indel variant analysis of short-read sequencing data with Scalpel , 2015, Nature Protocols.

[13]  Michael C. Heinold,et al.  The landscape of genomic alterations across childhood cancers , 2018, Nature.

[14]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[15]  F. McCoy,et al.  Janus-faced PIDD: a sensor for DNA damage-induced cell death or survival? , 2012, Molecular cell.

[16]  Mahdi Sarmady,et al.  Need for Automated Interactive Genomic Interpretation and Ongoing Reanalysis. , 2018, JAMA pediatrics.

[17]  Luonan Chen,et al.  Whole-exome sequencing reveals recurrent somatic mutation networks in cancer. , 2013, Cancer letters.

[18]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[19]  Theresa Zhang,et al.  Personalized genomic analyses for cancer mutation discovery and interpretation , 2015, Science Translational Medicine.

[20]  Terence P. Speed,et al.  Comparing somatic mutation-callers: beyond Venn diagrams , 2013, BMC Bioinformatics.

[21]  Mahdi Sarmady,et al.  The Development and Validation of Clinical Exome-Based Panels Using ExomeSlicer: Considerations and Proof of Concept Using an Epilepsy Panel. , 2018, The Journal of molecular diagnostics : JMD.

[22]  Li Ding,et al.  The Pediatric Cancer Genome Project , 2012, Nature Genetics.

[23]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[24]  Steven J. M. Jones,et al.  High quality SNP calling using Illumina data at shallow coverage , 2010, Bioinform..

[25]  Marilyn M. Li,et al.  Clinical utility of custom-designed NGS panel testing in pediatric tumors , 2019, Genome Medicine.

[26]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[27]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[28]  Shera Kash,et al.  Software-Assisted Manual Review of Clinical Next-Generation Sequencing Data: An Alternative to Routine Sanger Sequencing Confirmation with Equivalent Results in >15,000 Germline DNA Screens. , 2019, The Journal of molecular diagnostics : JMD.

[29]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[30]  Jie Gao,et al.  Comparison of Next-Generation Sequencing, Quantitative PCR, and Sanger Sequencing for Mutation Profiling of EGFR, KRAS, PIK3CA and BRAF in Clinical Lung Tumors. , 2016, Clinical laboratory.

[31]  S Joshua Swamidass,et al.  A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data , 2018, Nature Genetics.

[32]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[33]  Mads Thomassen,et al.  Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data , 2016, PloS one.

[34]  Benjamin J. Raphael,et al.  Integrated Analysis of Germline and Somatic Variants in Ovarian Cancer , 2014, Nature Communications.

[35]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[36]  Hugo Y. K. Lam,et al.  An ensemble approach to accurately detect somatic mutations using SomaticSeq , 2015, Genome Biology.

[37]  D. Mandelker,et al.  The emerging significance of secondary germline testing in cancer genomics , 2018, The Journal of pathology.

[38]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..