Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data

Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge. Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth ‘false positive’ predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study. Availability: Software called MutationSeq and datasets are available from http://compbio.bccrc.ca. Contact: saparicio@bccrc.ca Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  P. M. Hartigan,et al.  Computation of the Dip Statistic to Test for Unimodality , 1985 .

[2]  F. Ducray,et al.  IDH1 and IDH2 mutations in gliomas. , 2009, The New England journal of medicine.

[3]  Ryan D. Morin,et al.  Mutation of FOXL2 in granulosa-cell tumors of the ovary. , 2009, The New England journal of medicine.

[4]  Ryan D. Morin,et al.  Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution , 2009, Nature.

[5]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[8]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[9]  Joshua F. McMichael,et al.  Genome Remodeling in a Basal-like Breast Cancer Metastasis and Xenograft , 2010, Nature.

[10]  Kevin P. Murphy,et al.  SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors , 2010, Bioinform..

[11]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[12]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[13]  Richard A. Moore,et al.  ARID1A mutations in endometriosis-associated ovarian carcinomas. , 2010, The New England journal of medicine.

[14]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[15]  Steven J. M. Jones,et al.  Frequent mutation of histone modifying genes in non-Hodgkin lymphoma , 2011, Nature.

[16]  Trevor J Pugh,et al.  Initial genome sequencing and analysis of multiple myeloma , 2011, Nature.

[17]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[18]  P. A. Futreal,et al.  Exome sequencing identifies frequent mutation of the SWI/SNF complex gene PBRM1 in renal carcinoma , 2010, Nature.

[19]  André Altmann,et al.  vipR: variant identification in pooled DNA using R , 2011, Bioinform..

[20]  Juliane C. Dohm,et al.  Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia , 2011, Nature.

[21]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..

[22]  A. Mes-Masson,et al.  Subtype‐specific mutation of PPP2R1A in endometrial and ovarian carcinomas , 2011, The Journal of pathology.