Performance Assessment of Variant Calling Pipelines using Human Whole Exome Sequencing and Simulated data

The whole exome sequencing (WES) is a time-consuming technology in the identification of clinical variants and it demands the accurate variant caller tools. The currently available tools compromise accuracy in predicting the specific types of variants. Thus, it is important to find out the possible combination of best aligner-variant caller tools for detecting SNVs and InDels separately. Moreover, many important aspects of InDel detection are not overlooked while comparing the performance of tools. One such aspect is the detection of InDels with respect to base pair length. To assess the performance of variant (especially InDels) caller in combination with different aligners, 20 automated pipelines were developed and evaluated using gold reference variant dataset (NA12878) from Genome in a Bottle (GiaB) consortium of human whole exome sequencing. Additionally, the simulated exome data from two human reference genome sequences (GRCh37 and GRCh38) were used to compare the performance of the pipelines. By analyzing various performance metrices, we observed that BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for Indels. Altogether, DeepVariant with BWA and Novoalign performed best. Further, we showed that merging the top performing pipelines improved the accurate variant call set. Collectively, this study would help the investigators to effectively improve the sensitivity and accuracy in detecting specific variants.

[1]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[2]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[3]  John Quackenbush,et al.  What would you do if you could sequence everything? , 2008, Nature Biotechnology.

[4]  Ming Yi,et al.  Performance comparison of SNP detection tools with illumina exome sequencing data—an assessment using both family pedigree information and sample-matched SNP array data , 2014, Nucleic acids research.

[5]  Mohammad Shabbir Hasan,et al.  Performance evaluation of indel calling tools using real short-read data , 2015, Human Genomics.

[6]  M. Schatz,et al.  Reducing INDEL calling errors in whole genome and exome sequencing data , 2014, Genome Medicine.

[7]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[8]  R. Daniel Kortschak,et al.  A comparative analysis of algorithms for somatic SNV detection in cancer , 2013, Bioinform..

[9]  Joel Gelernter,et al.  Variant Callers for Next-Generation Sequencing Data: A Comparison Study , 2013, PloS one.

[10]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[11]  Jason R. Myers,et al.  Comparison of insertion/deletion calling algorithms on human next-generation sequencing data , 2014, BMC Research Notes.

[12]  M. Mielczarek,et al.  Review of alignment and SNP calling algorithms for next-generation sequencing data , 2015, Journal of Applied Genetics.

[13]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[14]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[15]  William,et al.  The Metabolic and Molecular Bases of Inherited Disease (Scriver, C. R., Beaudet, A. L., Sly, W. S., Valle, D., Childs, B., Kinzler, K. W., and Vogelstein, B., eds., 8th ed., McGraw-Hill, New-York, 2001, 7012 p., $550.00) , 2004, Biochemistry (Moscow).

[16]  M. Spector,et al.  A comparative analysis of exome capture , 2011, Genome Biology.

[17]  J. Potash,et al.  Validation and assessment of variant calling pipelines for next-generation sequencing , 2014, Human Genomics.

[18]  Stylianos E. Antonarakis,et al.  The nature and mechanisms of human gene mutation , 1995 .

[19]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[20]  Christian Gilissen,et al.  Novel bioinformatic developments for exome sequencing , 2016, Human Genetics.

[21]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[22]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[23]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[24]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[25]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[26]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[27]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.