Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing

The advent of next-generation sequencing technologies has greatly promoted advances in the study of human diseases at the genomic, transcriptomic, and epigenetic levels. Exome sequencing, where the coding region of the genome is captured and sequenced at a deep level, has proven to be a cost-effective method to detect disease-causing variants and discover gene targets. In this review, we outline the general framework of whole exome sequence data analysis. We focus on established bioinformatics tools and applications that support five analytical steps: raw data quality assessment, preprocessing, alignment, post-processing, and variant analysis (detection, annotation, and prioritization). We evaluate the performance of open-source alignment programs and variant calling tools using simulated and benchmark datasets, and highlight the challenges posed by the lack of concordance among variant detection tools. Based on these results, we recommend adopting multiple tools and resources to reduce false positives and increase the sensitivity of variant calling. In addition, we briefly discuss the current status and solutions for big data management, analysis, and summarization in the field of bioinformatics.

[1]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[2]  Rick Twee-Hee Ong,et al.  Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals , 2014, Bioinform..

[3]  Jiang Li,et al.  Multi-perspective quality control of Illumina exome sequencing data using QC3. , 2014, Genomics.

[4]  In-Hee Lee,et al.  Prioritizing Disease‐Linked Variants, Genes, and Pathways with an Interactive Whole‐Genome Analysis Pipeline , 2014, Human mutation.

[5]  D. G. MacArthur,et al.  Guidelines for investigating causality of sequence variants in human disease , 2014, Nature.

[6]  Funda Meric-Bernstam,et al.  Bias from removing read duplication in ultra-deep sequencing experiments , 2014, Bioinform..

[7]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[8]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[9]  R. Satya,et al.  Comparison of somatic mutation calling methods in amplicon and whole exome sequence data , 2014, BMC Genomics.

[10]  Joshua M. Stuart,et al.  Global optimization of somatic variant identification in cancer genomes with a global community challenge , 2014, Nature Genetics.

[11]  Rui Jiang,et al.  Integrating Multiple Genomic Data to Predict Disease-Causing Nonsynonymous Single Nucleotide Variants in Exome Sequencing Studies , 2014, PLoS genetics.

[12]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[13]  Andrew M. Rupert,et al.  The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors , 2014, Front. Genet..

[14]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[15]  Benjamin J. Raphael,et al.  Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine , 2014, Genome Medicine.

[16]  Erika Check Hayden Is the $1,000 genome for real? , 2014, Nature.

[17]  A. McKenna,et al.  Successful whole-exome sequencing from a prostate cancer bone metastasis biopsy , 2013, Prostate Cancer and Prostatic Disease.

[18]  E. Wijsman,et al.  Joint linkage and association analysis with exome sequence data implicates SLC25A40 in hypertriglyceridemia. , 2013, American journal of human genetics.

[19]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[20]  Mustafa Tekin,et al.  The promise of whole-exome sequencing in medical genetics , 2013, Journal of Human Genetics.

[21]  Yun S. Song,et al.  SMaSH: a benchmarking toolkit for human genome variant calling , 2013, Bioinform..

[22]  Lihua Julie Zhu,et al.  Accurate identification of polyadenylation sites from 3′ end deep sequencing using a naïve Bayes classifier , 2013, Bioinform..

[23]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[24]  Joel Gelernter,et al.  Variant Callers for Next-Generation Sequencing Data: A Comparison Study , 2013, PloS one.

[25]  E. Boerwinkle,et al.  dbNSFP v2.0: A Database of Human Non‐synonymous SNVs and Their Functional Predictions and Annotations , 2013, Human mutation.

[26]  V. Bafna,et al.  Virmid: accurate detection of somatic mutations with sample impurity inference , 2013, Genome Biology.

[27]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[28]  Tom Kamphans,et al.  Filtering for Compound Heterozygous Sequence Variants in Non-Consanguineous Pedigrees , 2013, PloS one.

[29]  R. Daniel Kortschak,et al.  A comparative analysis of algorithms for somatic SNV detection in cancer , 2013, Bioinform..

[30]  Mark Yandell,et al.  VAAST 2.0: Improved Variant Classification and Disease-Gene Identification Using a Conservation-Controlled Amino Acid Substitution Matrix , 2013, Genetic epidemiology.

[31]  Marc S. Williams,et al.  ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing , 2013, Genetics in Medicine.

[32]  Lan Mei,et al.  Shimmer: detection of genetic alterations in tumors using next-generation sequence data , 2013, Bioinform..

[33]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[34]  D. Curtis Approaches to the detection of recessive effects using next generation sequencing data from outbred populations , 2013, Advances and applications in bioinformatics and chemistry : AABC.

[35]  B. Knoppers,et al.  Whole-genome sequencing in health care , 2013, European Journal of Human Genetics.

[36]  J. Carpten,et al.  Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs , 2013 .

[37]  Jian Xu,et al.  QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data , 2013, PloS one.

[38]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[39]  Ronald W. Davis,et al.  Rare variant detection using family-based sequencing analysis , 2013, Proceedings of the National Academy of Sciences.

[40]  Yunlong Liu,et al.  NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets , 2013, Bioinform..

[41]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[42]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[43]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[44]  J. Long,et al.  Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data , 2012, BMC Genomics.

[45]  Francisco M. De La Vega,et al.  Genome and Transcriptome Sequencing in Prospective Metastatic Triple-Negative Breast Cancer Uncovers Therapeutic Vulnerabilities , 2012, Molecular Cancer Therapeutics.

[46]  Shashikant Kulkarni,et al.  Assuring the quality of next-generation sequencing in clinical laboratory practice , 2012, Nature Biotechnology.

[47]  Wei Chen,et al.  A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families , 2012, PLoS genetics.

[48]  Lingling An,et al.  A Statistical Framework for Accurate Taxonomic Assignment of Metagenomic Sequencing Reads , 2012, PloS one.

[49]  Joel S. Parker,et al.  ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data , 2012, BMC Bioinformatics.

[50]  Sorin Draghici,et al.  Detecting Phenotype-Specific Interactions between Biological Processes from Microarray Data and Annotations , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[52]  M. Birtwistle,et al.  Novel Somatic Mutations to PI3K Pathway Genes in Metastatic Melanoma , 2012, PloS one.

[53]  Ulf Leser,et al.  Data Management Challenges in Next Generation Sequencing , 2012, Datenbank-Spektrum.

[54]  Renhua Wu,et al.  Exome sequencing identifies NMNAT1 mutations as a cause of Leber congenital amaurosis , 2012, Nature Genetics.

[55]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[56]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[57]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[58]  D. Labuda,et al.  Mutations in C5ORF42 cause Joubert syndrome in the French Canadian population. , 2012, American journal of human genetics.

[59]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[60]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[61]  Eric D. Green,et al.  VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer , 2012, Bioinform..

[62]  Gabriele Gillessen-Kaesbach,et al.  Mutations in SRCAP, encoding SNF2-related CREBBP activator protein, cause Floating-Harbor syndrome. , 2012, American journal of human genetics.

[63]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[64]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[65]  D. Dimmock,et al.  Next-generation sequencing facilitates the diagnosis in a child with twinkle mutations causing cholestatic liver failure. , 2012, Journal of pediatric gastroenterology and nutrition.

[66]  Sohrab P. Shah,et al.  JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data , 2012, Bioinform..

[67]  Aleksei Aksimentiev,et al.  Slowing down DNA translocation through a nanopore in lithium chloride. , 2012, Nano letters.

[68]  Aleksandar Milosavljevic,et al.  An integrative variant analysis suite for whole exome next-generation sequencing data , 2012, BMC Bioinformatics.

[69]  Johnny S. H. Kwan,et al.  A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases , 2012, Nucleic acids research.

[70]  M. Gerstung,et al.  Reliable detection of subclonal single-nucleotide variants in tumour cell populations , 2012, Nature Communications.

[71]  Iuliana Ionita-Laza,et al.  Finding disease variants in Mendelian disorders by using sequence data: methods and applications. , 2011, American journal of human genetics.

[72]  Gholamreza Haffari,et al.  Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data , 2011, Bioinform..

[73]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[74]  Anton Nekrutenko,et al.  Harnessing cloud computing with Galaxy Cloud , 2011, Nature Biotechnology.

[75]  Manolis Kellis,et al.  HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants , 2011, Nucleic Acids Res..

[76]  P. Fortina,et al.  Whole-exome sequencing of DNA from peripheral blood mononuclear cells (PBMC) and EBV-transformed lymphocytes from the same donor , 2011, BMC Genomics.

[77]  Dan-Yu Lin,et al.  A general framework for detecting disease associations with rare variants in sequencing studies. , 2011, American journal of human genetics.

[78]  Shuangping Zhao,et al.  An Integrated Bioinformatics Approach Identifies Elevated Cyclin E2 Expression and E2F Activity as Distinct Features of Tamoxifen Resistant Breast Tumors , 2011, PloS one.

[79]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[80]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[81]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[82]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[83]  S. Davis,et al.  Exome sequencing identifies GRIN2A as frequently mutated in melanoma , 2011, Nature Genetics.

[84]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[85]  David P Bick,et al.  Making a definitive diagnosis: Successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease , 2011, Genetics in Medicine.

[86]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[87]  Iuliana Ionita-Laza,et al.  A New Testing Strategy to Identify Rare Variants with Either Risk or Protective Effect on Disease , 2011, PLoS genetics.

[88]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[89]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[90]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[91]  S. Nelson,et al.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA , 2010, Genome Biology.

[92]  Gaurav Bhatia,et al.  A Covering Method for Detecting Genetic Associations between Rare Variants and Common Phenotypes , 2010, PLoS Comput. Biol..

[93]  Suzanne M. Leal,et al.  A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions , 2010, PLoS genetics.

[94]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[95]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[96]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[97]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[98]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[99]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[100]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[101]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[102]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[103]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[104]  A. Singleton,et al.  Rare Structural Variants Disrupt Multiple Genes in Neurodevelopmental Pathways in Schizophrenia , 2008, Science.

[105]  Anthony R. Dallosso,et al.  Multiple rare nonsynonymous variants in the adenomatous polyposis coli gene predispose to colorectal adenomas. , 2008, Cancer research.

[106]  T. Suormala,et al.  Intermediate hyperhomocysteinaemia and compound heterozygosity for the common variant c.677C>T and a MTHFR gene mutation , 2007, Journal of Inherited Metabolic Disease.

[107]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[108]  T. Barrette,et al.  Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. , 2007, Neoplasia.

[109]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[110]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[111]  S. Frank Genetic predisposition to cancer — insights from population genetics , 2004, Nature Reviews Genetics.

[112]  Jonathan C. Cohen,et al.  Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol , 2004, Science.

[113]  S. Dréano,et al.  NOD2/CARD15 gene polymorphisms in Crohn's disease: a genotype–phenotype analysis , 2004, European journal of gastroenterology & hepatology.

[114]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[115]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[116]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[117]  Dennis C. Friedrich,et al.  MAP kinase pathway alterations in BRAF-mutant melanoma patients with acquired resistance to combined RAF/MEK inhibition. , 2014, Cancer discovery.

[118]  Yanjing Shi,et al.  Genome-wide study of NAGNAG alternative splicing in Arabidopsis , 2013, Planta.

[119]  Hesaam Esfandyarpour,et al.  Genapsys 100X Solution: Label-free Fully-integrated “Personal Genomixer” , 2012 .

[120]  C. Chung,et al.  Application of genomic and proteomic technologies in biomarker discovery. , 2012, American Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual Meeting.

[121]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[122]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[123]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[124]  Shamil R Sunyaev,et al.  Pooled association tests for rare variants in exon-resequencing studies. , 2010, American journal of human genetics.

[125]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[126]  H. Rubash MASSACHUSETTS General Hospital. , 1957, Medical times.