A Bioinformatics Pipeline for Whole Exome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis

[Abstract] Recent advances in Next Generation Sequencing (NGS) technologies have given an impetus to find causality for rare genetic disorders. Since 2005 and aftermath of the human genome project, efforts have been made to understand the rare variants of genetic disorders. Benchmarking the bioinformatics pipeline for whole exome sequencing (WES) has always been a challenge. In this protocol, we discuss detailed steps from quality check to analysis of the variants using a WES pipeline comparing them with reposited public NGS data and survey different techniques, algorithms and software tools used during each step. We observed that variant calling performed on exome and whole genome datasets have different metrics generated when compared to variant callers, GATK and VarScan with different parameters. Furthermore, we found that VarScan with strict parameters could recover 80-85% of high quality GATK SNPs with decreased sensitivity from NGS data. We believe our protocol in the form of pipeline can be used by researchers interested in performing WES analysis for genetic diseases and any clinical phenotypes.

[1]  Kai Wang,et al.  SeqMule: automated pipeline for analysis of human exome/genome sequencing data , 2015, Scientific Reports.

[2]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[3]  Zlatko Trajanoski,et al.  SIMPLEX: Cloud-Enabled Pipeline for the Comprehensive Analysis of Exome Sequencing Data , 2012, PloS one.

[4]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[5]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[6]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[7]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[8]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[9]  Andrew B Singleton,et al.  Exome sequencing: a transformative technology , 2011, The Lancet Neurology.

[10]  Anushya Muruganujan,et al.  PANTHER version 10: expanded protein families and functions, and analysis tools , 2015, Nucleic Acids Res..

[11]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[12]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[13]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[14]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[15]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[16]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[17]  Rafael Aldana,et al.  Sentieon DNA pipeline for variant detection - Software-only solution, over 20× faster than GATK 3.3 with identical results , 2016 .

[18]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[19]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[20]  Kai Wang,et al.  wANNOVAR: annotating genetic variants for personal genomes via the web , 2012, Journal of Medical Genetics.

[21]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..