Next-Generation Sequencing Data Analysis (Next-Generation Sequenzierung Datenanalyse)

Next-Generation Sequencing (NGS) has become one of the most important tools in the field of human genetics. Targeted resequencing of the coding part of the human genome (exome sequencing) has been performed on more than 4,500 samples from over 80 different projects in the course of this PhD project. The samples have been sequenced to identify pathogenic variants and disease associated genes in rare and common diseases. The aim of this PhD project was to investigate and develop methods and parameters to identify such pathogenic variants and genes from large amounts of exome sequencing data. An existing analysis pipeline has been modified on a large scale in order to reduce runtime, memory usage, required disk space and hands-on time, as well as to increase flexibility and allow easier adaptation and extension. Additionally, new features have been implemented to allow the analysis of other features of the data, such as Structural Variants (SVs) or Copy Number Variations (CNVs), and to allow multiple users to analyze large projects collaboratively. The data produced during this PhD project has been used to evaluate requirements on study design and certain key quality metrics of exome sequencing data. Several programs and strategies for variant calling have been benchmarked. Influences of different variant calling procedures and variant quality metrics on sensitivity and specificity have been evaluated and used to draw conclusions on best-practice variant calling. Additionaly, variant calling in RNA sequencing data for detection of RNA editing is discussed. Variant callers detect on average approximately 23,000 high quality coding variants per exome. Guidelines on filtering and selecting these variants in order to identify those that are disease causing, have been developed and are illustrated by examples, if applicable.

[1]  R. Reading,et al.  Diagnostic exome sequencing in persons with severe intellectual disability , 2013 .

[2]  K. Klinger,et al.  Alternative splicing of exon 3 of the human growth hormone receptor is the result of an unusual genetic polymorphism. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Nicole Rusk Torrents of sequence , 2011, Nature Methods.

[4]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[5]  Morgan C. Giddings,et al.  Defining functional DNA elements in the human genome , 2014, Proceedings of the National Academy of Sciences.

[6]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[7]  T. Wieland,et al.  DHTKD1 mutations cause 2-aminoadipic and 2-oxoadipic aciduria. , 2012, American journal of human genetics.

[8]  G. Pesole,et al.  A Novel Computational Strategy to Identify A-to-I RNA Editing Sites by RNA-Seq Data: De Novo Detection in Human Spinal Cord Tissue , 2012, PloS one.

[9]  Vineet Bafna,et al.  Wessim: a whole-exome sequencing simulator based on in silico exome capture , 2013, Bioinform..

[10]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[11]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[12]  Mingyao Li,et al.  Widespread RNA and DNA Sequence Differences in the Human Transcriptome , 2011, Science.

[13]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[14]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[15]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[16]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[17]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[18]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[19]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[20]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[21]  Caspar Zialor DNA sequencing with chain terminating inhibitors , 2014 .

[22]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[23]  Kazuko Nishikura,et al.  Adenosine-to-inosine RNA editing and human disease , 2013, Genome Medicine.

[24]  M. Baumgartner,et al.  Lack of the mitochondrial protein acylglycerol kinase causes Sengers syndrome. , 2012, American journal of human genetics.

[25]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[26]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[27]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[28]  H. Bayley,et al.  Continuous base identification for single-molecule nanopore DNA sequencing. , 2009, Nature nanotechnology.

[29]  M. Manns,et al.  Mutations of the cystic fibrosis gene, but not cationic trypsinogen gene, are associated with recurrent or chronic idiopathic pancreatitis , 2000, American Journal of Gastroenterology.

[30]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[31]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[32]  N. Rajewsky,et al.  The evolution of gene regulation by transcription factors and microRNAs , 2007, Nature Reviews Genetics.

[33]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[34]  T. Wieland,et al.  Phenotypic spectrum of eleven patients and five novel MTFMT mutations identified by exome sequencing and candidate gene screening. , 2014, Molecular genetics and metabolism.

[35]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[36]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[37]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[38]  Hugo Y. K. Lam,et al.  Performance comparison of exome DNA sequencing technologies , 2011, Nature Biotechnology.

[39]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[41]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[42]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[43]  S. Freedman,et al.  Cystic fibrosis , 2009, The Lancet.

[44]  T. Wieland,et al.  Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension , 2013, Nature Genetics.

[45]  T. Wieland,et al.  Molecular diagnosis in mitochondrial complex I deficiency using exome sequencing , 2012, Journal of Medical Genetics.

[46]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[47]  Leslie G Biesecker,et al.  Databases of genomic variation and phenotypes: existing resources and future needs. , 2013, Human molecular genetics.

[48]  Kenny Q. Ye,et al.  De Novo Gene Disruptions in Children on the Autistic Spectrum , 2012, Neuron.

[49]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[50]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[51]  Yukio Kawahara,et al.  A-to-I RNA Editing and Human Disease , 2006, RNA biology.

[52]  T. Wieland,et al.  Exome sequence reveals mutations in CoA synthase as a cause of neurodegeneration with brain iron accumulation. , 2014, American journal of human genetics.

[53]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[54]  Jin Billy Li,et al.  Accurate identification of human Alu and non-Alu RNA editing sites , 2012, Nature Methods.

[55]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[56]  K. Nishikura,et al.  A third member of the RNA-specific adenosine deaminase gene family, ADAR3, contains both single- and double-stranded RNA binding domains. , 2000, RNA.

[57]  Nicholas W. Wood,et al.  A robust model for read count data in exome sequencing experiments and implications for copy number variant calling , 2012, Bioinform..

[58]  S. Maas,et al.  Molecular diversity through RNA editing: a balancing act. , 2010, Trends in genetics : TIG.

[59]  T. Wieland,et al.  Impaired riboflavin transport due to missense mutations in SLC52A2 causes Brown-Vialetto-Van Laere syndrome , 2012, Journal of Inherited Metabolic Disease.

[60]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[61]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[62]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[63]  Pavel V. Baranov,et al.  DARNED: a DAtabase of RNa EDiting in humans , 2010, Bioinform..

[64]  Dmitrij Frishman,et al.  The MIPS mammalian protein?Cprotein interaction database , 2005, Bioinform..

[65]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[66]  Marc S. Williams,et al.  ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing , 2013, Genetics in Medicine.

[67]  K. Nishikura Functions and regulation of RNA editing by ADAR deaminases. , 2010, Annual review of biochemistry.

[68]  T. Wieland,et al.  Exome sequencing reveals de novo WDR45 mutations causing a phenotypically distinct, X-linked dominant form of NBIA. , 2012, American journal of human genetics.

[69]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[70]  V. Mootha,et al.  Loss-of-function mutations in MGME1 impair mtDNA replication and cause multisystemic mitochondrial disease , 2013, Nature Genetics.

[71]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[72]  Henning Hermjakob,et al.  Analyzing protein-protein interaction networks. , 2012, Journal of proteome research.

[73]  W. Grody,et al.  ACMG recommendations for standards for interpretation and reporting of sequence variations: Revisions 2007 , 2008, Genetics in Medicine.

[74]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[75]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[76]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[77]  Tanya M. Teslovich,et al.  Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes , 2012, Nature Genetics.

[78]  M. Waldenberger,et al.  Compound heterozygosity of low-frequency promoter deletions and rare loss-of-function mutations in TXNL4A causes Burn-McKeown syndrome. , 2014, American journal of human genetics.

[79]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[80]  Thomas Meitinger,et al.  Exome sequencing identifies ACAD9 mutations as a cause of complex I deficiency , 2010, Nature Genetics.

[81]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[82]  David Stoddart,et al.  Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore , 2009, Proceedings of the National Academy of Sciences.

[83]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[84]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[85]  Marc N. Offman,et al.  A mutation in VPS35, encoding a subunit of the retromer complex, causes late-onset Parkinson disease. , 2011, American journal of human genetics.

[86]  Xiaohui Xie,et al.  Identifying novel constrained elements by exploiting biased substitution patterns , 2009, Bioinform..

[87]  Mingyao Li,et al.  Response to Comments on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome” , 2012, Science.

[88]  J. Taanman,et al.  The mitochondrial genome: structure, transcription, translation and replication. , 1999, Biochimica et biophysica acta.

[89]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[90]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[91]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[92]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[93]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[94]  Marni J. Falk,et al.  Mutations in FBXL4, encoding a mitochondrial protein, cause early-onset mitochondrial encephalomyopathy. , 2013, American journal of human genetics.

[95]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[96]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[97]  Zhiyu Peng,et al.  Lack of evidence for existence of noncanonical RNA editing , 2013, Nature Biotechnology.

[98]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[99]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[100]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[101]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[102]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[103]  Jin Billy Li,et al.  Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome” , 2012, Science.

[104]  S. Eck Identification of genetic variation using Next-Generation Sequencing , 2014 .

[105]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[106]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[107]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[108]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[109]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[110]  J Oyston,et al.  Online Mendelian Inheritance in Man. , 1998, Anesthesiology.

[111]  George M Church,et al.  Deciphering the functions and regulation of brain-enriched A-to-I RNA editing , 2013, Nature Neuroscience.

[112]  E. Boerwinkle,et al.  dbNSFP v2.0: A Database of Human Non‐synonymous SNVs and Their Functional Predictions and Annotations , 2013, Human mutation.

[113]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[114]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[115]  J. Dungan,et al.  Carrier Testing for Severe Childhood Recessive Diseases by Next-Generation Sequencing , 2012 .

[116]  H. Leonard,et al.  The epidemiology of mental retardation: challenges and opportunities in the new millennium. , 2002, Mental retardation and developmental disabilities research reviews.

[117]  D. Horn,et al.  Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study , 2012, The Lancet.

[118]  G. Church,et al.  Genome-Wide Identification of Human RNA Editing Sites by Parallel DNA Capturing and Sequencing , 2009, Science.

[119]  Joshua L. Deignan,et al.  ACMG clinical laboratory standards for next-generation sequencing , 2013, Genetics in Medicine.

[120]  Robert W. Taylor,et al.  ELAC2 mutations cause a mitochondrial RNA processing defect associated with hypertrophic cardiomyopathy. , 2013, American journal of human genetics.

[121]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[122]  L. Tsui,et al.  Spectrum of mutations in the CFTR gene of patients with classical and atypical forms of cystic fibrosis from southwestern Sweden: identification of 12 novel mutations. , 2001, Genetic testing.

[123]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[124]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[125]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[126]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[127]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[128]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[129]  Kathryn Roeder,et al.  Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes , 2013, PLoS genetics.

[130]  Joel Gelernter,et al.  Variant Callers for Next-Generation Sequencing Data: A Comparison Study , 2013, PloS one.

[131]  Judith A. Blake,et al.  The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse , 2013, Nucleic Acids Res..