Statistical analysis of high-throughput sequencing count data

High-throughput sequencing (HTS) refers to the simultaneous sequencing of millions of fragments of DNA, which can be either assembled to reconstitute a genome, or aligned to an existing reference genome. The protocol can be extended to assay a wide variety of biological states of the cell, including DNA copy number, mRNA abundance and various properties of chromatin. HTS experiments allow for these biological states to be quantified as read counts at genome-wide scale with a single experiment. Though the experiments are expensive and often datasets are produced with limited sample size, information can be shared across thousands of genomic ranges in order to obtain robust models which control for technical biases. In this thesis, I present three statistical models for analyzing HTS read count data, aimed at answering concise biological questions. First, a hidden Markov model is developed for detecting copy number variants (CNVs) in individual samples while controlling for technical artifacts, such as variation in read counts due to local GC-content. Applied to a study of 248 male patients with X-linked intellectual disability, the model predicts 16 large CNVs, of which 10 candidate disease-causing CNVs were tested and all experimentally validated. The proposed software is then compared with state-of-the-art segmentation algorithms on normalized data, showing higher sensitivity while controlling the total rate of predicted CNVs. Second, improvements for parameter estimation are made for a statistical model of differential gene expression from RNA-Seq data. The improvements involve the use of empirical Bayes priors  – priors estimated using the observations from all genes  – in order to moderate otherwise noisy estimates of dispersion and fold changes for individual genes. The improved model shows increased sensitivity and more robust estimation of fold change in comparison with other differential expression software packages for RNA-Seq. Finally, a hierarchical Bayes model is used to associate transcription factor binding with chromatin and sequence features in regions of accessible chromatin. The hierarchical model incorporates three levels of parameters: one for individual experiments, one for experiments of the same cell type and one across all cell types. The model parameters are used to generate hypotheses regarding the DNA-binding behavior of a transcription factor, the glucocorticoid receptor. In summary, this thesis describes a set of statistical methods for HTS read count data which can be used across various biological domains. The methods form a framework for robust estimation of variables and hypothesis testing.%%%%Mit Hochdurchsatz-Sequenzierverfahren (HTS) bezeichnet man das gleichzeitige Sequenzieren von Millionen von DNA-Fragmenten, welche entweder zur Genomrekonstrution genutzt oder auf ein bestehendes Referenzgenom aligniert werden konnen. Das Protokoll kann erweitert werden, um verschiedene biologische Zustande der Zelle, wie z.B. die Anzahl an DNA-Kopien, mRNA-Abundanzen oder…

[1]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[2]  Andrew E. Jaffe,et al.  Gene set bagging for estimating the probability a statistically significant result will replicate , 2013, BMC Bioinformatics.

[3]  Robert Tibshirani,et al.  Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[4]  C. Vinson,et al.  C/EBP maintains chromatin accessibility in liver and facilitates glucocorticoid receptor recruitment to steroid response elements , 2013, The EMBO journal.

[5]  John D. Storey,et al.  Gene set bagging for estimating replicability of gene set analyses , 2013, 1301.3933.

[6]  Qian Wang,et al.  GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data , 2012, Bioinform..

[7]  Inga-Lena Nilsson,et al.  Evidence of a functional estrogen receptor in parathyroid adenomas. , 2012, The Journal of clinical endocrinology and metabolism.

[8]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[9]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[10]  H. Kimura,et al.  H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a subset of inactive inducible promoters in mouse embryonic stem cells , 2012, BMC Genomics.

[11]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[12]  Martin Vingron,et al.  R2KS: A Novel Measure for Comparing Gene Expression Based on Ranked Gene Lists , 2012, J. Comput. Biol..

[13]  J. J. Shen,et al.  Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing , 2012, 1206.6627.

[14]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[15]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[16]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[17]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[18]  Steven P Lund,et al.  Statistical Applications in Genetics and Molecular Biology Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates , 2012 .

[19]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[20]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[21]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[22]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[23]  Nathan C. Sheffield,et al.  Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. , 2011, Genome research.

[24]  John Quackenbush,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV , 2011, Bioinform..

[25]  Erika Cule,et al.  Significance testing in ridge regression for genetic data , 2011, BMC Bioinformatics.

[26]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[27]  J. Ibrahim,et al.  ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions , 2011, Genome Biology.

[28]  Myong-Hee Sung,et al.  Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. , 2011, Molecular cell.

[29]  M. Rieder,et al.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations , 2011, Nature Genetics.

[30]  Sara B. Linker,et al.  Comparison of Three Targeted Enrichment Strategies on the SOLiD Sequencing Platform , 2011, PloS one.

[31]  Tom Walsh,et al.  Accurate and exact CNV identification from targeted high-throughput sequence data , 2011, BMC Genomics.

[32]  J. Troge,et al.  Tumour evolution inferred by single-cell sequencing , 2011, Nature.

[33]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[34]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[35]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..

[36]  Li Yang,et al.  Conservation of an RNA regulatory map between Drosophila and mammals. , 2011, Genome research.

[37]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[38]  J. Stamatoyannopoulos,et al.  Chromatin accessibility pre-determines glucocorticoid receptor binding patterns , 2011, Nature Genetics.

[39]  Simon Tavaré,et al.  CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data , 2010, Bioinform..

[40]  R. Young,et al.  Histone H3K27ac separates active from poised enhancers and predicts developmental state , 2010, Proceedings of the National Academy of Sciences.

[41]  Emmanuel Barillot,et al.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[42]  Huanming Yang,et al.  Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants , 2010, Nature Genetics.

[43]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[44]  B. Graveley The developmental transcriptome of Drosophila melanogaster , 2010, Nature.

[45]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[46]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[47]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[48]  S. Luo,et al.  mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. , 2010, Genome research.

[49]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[50]  Mikael Huss,et al.  Introduction into the analysis of high-throughput-sequencing based epigenome data , 2010, Briefings Bioinform..

[51]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[52]  Hongkai Ji,et al.  Analyzing 'omics data using hierarchical models , 2010, Nature Biotechnology.

[53]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[54]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[55]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[56]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[57]  Julia A. Lasserre,et al.  Histone modification levels are predictive for gene expression , 2010, Proceedings of the National Academy of Sciences.

[58]  William Stafford Noble,et al.  How does multiple testing correction work? , 2009, Nature Biotechnology.

[59]  Davis J. McCarthy,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[60]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[61]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[62]  K. Reinert,et al.  RazerS--fast read mapping with sensitivity control. , 2009, Genome research.

[63]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[64]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[65]  J. Seidman,et al.  Filter-based hybridization capture of subgenomes enables resequencing and copy-number detection , 2009, Nature Methods.

[66]  Robert T. Schultz,et al.  Autism genome-wide copy number variation reveals ubiquitin and neuronal genes , 2009, Nature.

[67]  K. Yamamoto,et al.  DNA Binding Site Sequence Directs Glucocorticoid Receptor Structure and Activity , 2009, Science.

[68]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[69]  S. Salzberg,et al.  TopHat: discovering splice junctions with RNA-Seq , 2009, Bioinform..

[70]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[71]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[72]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[73]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[74]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[75]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[76]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[77]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[78]  Robert Gentleman,et al.  Bioconductor Case Studies , 2008 .

[79]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[80]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[81]  J. Stamatoyannopoulos,et al.  Interaction of the Glucocorticoid Receptor with the Chromatin Landscape , 2008, Molecular cell.

[82]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[83]  Gil Ast,et al.  Alternative splicing and disease , 2008, RNA biology.

[84]  L. Armengol,et al.  X-chromosome tiling path array detection of copy number variants in patients with chromosome X-linked mental retardation , 2007, BMC Genomics.

[85]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[86]  Roger S Lasken,et al.  Single-cell genomic sequencing using Multiple Displacement Amplification. , 2007, Current opinion in microbiology.

[87]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[88]  Jane M J Lin,et al.  Identification and Characterization of Cell Type–Specific and Ubiquitous Chromatin Regulatory Structures in the Human Genome , 2007, PLoS genetics.

[89]  R. Young,et al.  A Chromatin Landmark and Transcription Initiation at Most Promoters in Human Cells , 2007, Cell.

[90]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[91]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[92]  L. Feuk,et al.  Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome , 2006, Cytogenetic and Genome Research.

[93]  Abdel H. El-Shaarawi,et al.  Negative Binomial Distribution , 2006 .

[94]  Simon Tavaré,et al.  BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data , 2006, Bioinform..

[95]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[96]  Bing Li,et al.  Histone H3 Methylation by Set2 Directs Deacetylation of Coding Regions by Rpd3S to Suppress Spurious Intragenic Transcription , 2005, Cell.

[97]  B. Rovin,et al.  The Influence of CCL3L1 Gene-Containing Segmental Duplications on HIV-1/AIDS Susceptibility , 2005, Science.

[98]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[99]  Ajay N. Jain,et al.  Hidden Markov models approach to the analysis of array CGH data , 2004 .

[100]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[101]  Donald E. Olins,et al.  Chromatin history: our view from the bridge , 2003, Nature Reviews Molecular Cell Biology.

[102]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[103]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[104]  Douglas M. Hawkins,et al.  A variance-stabilizing transformation for gene-expression microarray data , 2002, ISMB.

[105]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[106]  Martin Vingron,et al.  Identifying splits with clear separation: a new class discovery method for gene expression data , 2001, ISMB.

[107]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[108]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[109]  R. Dennis Cook,et al.  Detection of Influential Observation in Linear Regression , 2000, Technometrics.

[110]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[111]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[112]  V. van Heyningen,et al.  Position effect in human genetic disease. , 1998, Human molecular genetics.

[113]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[114]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[115]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[116]  D. Cox,et al.  Parameter Orthogonality and Approximate Conditional Inference , 1987 .

[117]  R. W. Wedderburn Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method , 1974 .

[118]  M. Bulmer On Fitting the Poisson Lognormal Distribution to Species-Abundance Data , 1974 .

[119]  T. Elsdale,et al.  Sexually Mature Individuals of Xenopus laevis from the Transplantation of Single Somatic Nuclei , 1958, Nature.

[120]  C. I. Bliss,et al.  FITTING THE NEGATIVE BINOMIAL DISTRIBUTION TO BIOLOGICAL DATA AND NOTE ON THE EFFICIENT FITTING OF THE NEGATIVE BINOMIAL , 1953 .

[121]  Anscombe Fj The statistical analysis of insect counts based on the negative binomial distribution. , 1949 .

[122]  A. Wald Tests of statistical hypotheses concerning several parameters when the number of observations is large , 1943 .

[123]  David Croft,et al.  Building models using Reactome pathways as templates. , 2013, Methods in molecular biology.

[124]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[125]  Martin Vingron,et al.  Statistical Applications in Genetics and Molecular Biology Modeling Read Counts for CNV Detection in Exome Sequencing Data , 2011 .

[126]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[127]  D. St. Clair,et al.  Copy number variation and schizophrenia. , 2009, Schizophrenia bulletin.

[128]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[129]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[130]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[131]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[132]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[133]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[134]  F. J. Anscombe,et al.  The statistical analysis of insect counts based on the negative binomial distribution. , 1949, Biometrics.

[135]  A. Olshen,et al.  A Faster Circular Binary Segmentation Algorithm for the Analysis of Array Cgh Data , 2022 .