Collaborative science in the next-generation sequencing era: a viewpoint on how to combine exome sequencing data across sites to identify novel disease susceptibility genes

The purpose of this article is to inform readers about technical challenges that we encountered when assembling exome sequencing data from the 'Simplifying Complex Exomes' (SIMPLEXO) consortium-whose mandate is the discovery of novel genes predisposing to breast and ovarian cancers. Our motivation is to share these obstacles-and our solutions to them-as a means of communicating important technical details that should be discussed early in projects involving massively parallel sequencing.

[1]  Renhua Wu,et al.  A large-scale screen for coding variants predisposing to psoriasis , 2013, Nature Genetics.

[2]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[3]  D. G. MacArthur,et al.  Guidelines for investigating causality of sequence variants in human disease , 2014, Nature.

[4]  Mads Thomassen,et al.  Identification of a BRCA2-Specific Modifier Locus at 6p24 Related to Breast Cancer Risk , 2013, PLoS genetics.

[5]  Jean-Baptiste Cazier,et al.  Choice of transcripts and software has a large effect on variant annotation , 2014, Genome Medicine.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  Pierre Fontanillas,et al.  Association of exome sequences with plasma C-reactive protein levels in >9000 participants. , 2015, Human molecular genetics.

[8]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[9]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[10]  Karsten Suhre,et al.  Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance , 2014, BMC Research Notes.

[11]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[12]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[13]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[14]  Eric S. Lander,et al.  A polygenic burden of rare disruptive mutations in schizophrenia , 2014, Nature.

[15]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[16]  Patrick Neven,et al.  Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer , 2015 .

[17]  Xiaoquan Wen,et al.  Coverage and Characteristics of the Affymetrix GeneChip Human Mapping 100K SNP Set , 2006, PLoS genetics.

[18]  David M. Herrington,et al.  Multiple rare alleles at LDLR and APOA5 confer risk for early-onset myocardial infarction , 2014, Nature.

[19]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[20]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[21]  Thomas Meitinger,et al.  Loss-of-function mutations in SLC30A8 protect against type 2 diabetes , 2014, Nature Genetics.

[22]  Aaron R. Quinlan,et al.  GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations , 2013, PLoS Comput. Biol..

[23]  Nuno A. Fonseca,et al.  Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction , 2015, BMC Genomics.

[24]  Peter Kraft,et al.  COMPLEXO: identifying the missing heritability of breast cancer via next generation collaboration , 2013, Breast Cancer Research.

[25]  Jacqueline A. L. MacArthur,et al.  Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants , 2013, Nucleic Acids Res..

[26]  Xiaoqing Yu,et al.  How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? , 2012, BioData Mining.

[27]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[28]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .