A comparative analysis of algorithms for somatic SNV detection in cancer

Motivation: With the advent of relatively affordable high-throughput technologies, DNA sequencing of cancers is now common practice in cancer research projects and will be increasingly used in clinical practice to inform diagnosis and treatment. Somatic (cancer-only) single nucleotide variants (SNVs) are the simplest class of mutation, yet their identification in DNA sequencing data is confounded by germline polymorphisms, tumour heterogeneity and sequencing and analysis errors. Four recently published algorithms for the detection of somatic SNV sites in matched cancer–normal sequencing datasets are VarScan, SomaticSniper, JointSNVMix and Strelka. In this analysis, we apply these four SNV calling algorithms to cancer–normal Illumina exome sequencing of a chronic myeloid leukaemia (CML) patient. The candidate SNV sites returned by each algorithm are filtered to remove likely false positives, then characterized and compared to investigate the strengths and weaknesses of each SNV calling algorithm. Results: Comparing the candidate SNV sets returned by VarScan, SomaticSniper, JointSNVMix2 and Strelka revealed substantial differences with respect to the number and character of sites returned; the somatic probability scores assigned to the same sites; their susceptibility to various sources of noise; and their sensitivities to low-allelic-fraction candidates. Availability: Data accession number SRA081939, code at http://code.google.com/p/snv-caller-review/ Contact: david.adelson@adelaide.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  B. Rannala,et al.  Likelihood models of somatic mutation and codon substitution in cancer genes. , 2003, Genetics.

[2]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[3]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[4]  E. Oki,et al.  The Difference in p53 Mutations between Cancers of the Upper and Lower Gastrointestinal Tract , 2009, Digestion.

[5]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[6]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[7]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[8]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[9]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[10]  Michael Gundry,et al.  Direct mutation analysis by high-throughput sequencing: from germline to low-abundant, somatic variants. , 2012, Mutation research.

[11]  R. Prayson,et al.  Mutational Heterogeneity in Human Cancers : Origin and Consequences , 2010 .

[12]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[13]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[14]  G. Merlino,et al.  Genetic instability favoring transversions associated with ErbB2-induced mammary tumorigenesis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  C. Swanton,et al.  Tumour heterogeneity and drug resistance: personalising cancer medicine through functional genomics. , 2012, Biochemical pharmacology.

[16]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[17]  Scott F. Saccone,et al.  Bioinformatics Applications Note Databases and Ontologies Bioq: Tracing Experimental Origins in Public Genomic Databases Using a Novel Data Provenance Model , 2022 .

[18]  L. Loeb,et al.  Human cancers express mutator phenotypes: origin, consequences and targeting , 2011, Nature Reviews Cancer.

[19]  E. Mardis,et al.  Analysis of next-generation genomic data in cancer: accomplishments and challenges. , 2010, Human molecular genetics.

[20]  Sohrab P. Shah,et al.  JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data , 2012, Bioinform..

[21]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .