论文信息 - Statistical analysis of high-throughput sequencing count data

Statistical analysis of high-throughput sequencing count data

High-throughput sequencing (HTS) refers to the simultaneous sequencing of millions of fragments of DNA, which can be either assembled to reconstitute a genome, or aligned to an existing reference genome. The protocol can be extended to assay a wide variety of biological states of the cell, including DNA copy number, mRNA abundance and various properties of chromatin. HTS experiments allow for these biological states to be quantified as read counts at genome-wide scale with a single experiment. Though the experiments are expensive and often datasets are produced with limited sample size, information can be shared across thousands of genomic ranges in order to obtain robust models which control for technical biases. In this thesis, I present three statistical models for analyzing HTS read count data, aimed at answering concise biological questions. First, a hidden Markov model is developed for detecting copy number variants (CNVs) in individual samples while controlling for technical artifacts, such as variation in read counts due to local GC-content. Applied to a study of 248 male patients with X-linked intellectual disability, the model predicts 16 large CNVs, of which 10 candidate disease-causing CNVs were tested and all experimentally validated. The proposed software is then compared with state-of-the-art segmentation algorithms on normalized data, showing higher sensitivity while controlling the total rate of predicted CNVs. Second, improvements for parameter estimation are made for a statistical model of differential gene expression from RNA-Seq data. The improvements involve the use of empirical Bayes priors – priors estimated using the observations from all genes – in order to moderate otherwise noisy estimates of dispersion and fold changes for individual genes. The improved model shows increased sensitivity and more robust estimation of fold change in comparison with other differential expression software packages for RNA-Seq. Finally, a hierarchical Bayes model is used to associate transcription factor binding with chromatin and sequence features in regions of accessible chromatin. The hierarchical model incorporates three levels of parameters: one for individual experiments, one for experiments of the same cell type and one across all cell types. The model parameters are used to generate hypotheses regarding the DNA-binding behavior of a transcription factor, the glucocorticoid receptor. In summary, this thesis describes a set of statistical methods for HTS read count data which can be used across various biological domains. The methods form a framework for robust estimation of variables and hypothesis testing.%%%%Mit Hochdurchsatz-Sequenzierverfahren (HTS) bezeichnet man das gleichzeitige Sequenzieren von Millionen von DNA-Fragmenten, welche entweder zur Genomrekonstrution genutzt oder auf ein bestehendes Referenzgenom aligniert werden konnen. Das Protokoll kann erweitert werden, um verschiedene biologische Zustande der Zelle, wie z.B. die Anzahl an DNA-Kopien, mRNA-Abundanzen oder…

M. Love

[1] Charity W. Law,et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[2] Andrew E. Jaffe,et al. Gene set bagging for estimating the probability a statistically significant result will replicate , 2013, BMC Bioinformatics.

[3] Robert Tibshirani,et al. Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[4] C. Vinson,et al. C/EBP maintains chromatin accessibility in liver and facilitates glucocorticoid receptor recruitment to steroid response elements , 2013, The EMBO journal.

[5] John D. Storey,et al. Gene set bagging for estimating replicability of gene set analyses , 2013, 1301.3933.

[6] Qian Wang,et al. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data , 2012, Bioinform..

[7] Inga-Lena Nilsson,et al. Evidence of a functional estrogen receptor in parathyroid adenomas. , 2012, The Journal of clinical endocrinology and metabolism.

[8] Hao Wu,et al. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[9] Shane J. Neph,et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[10] H. Kimura,et al. H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a subset of inactive inducible promoters in mouse embryonic stem cells , 2012, BMC Genomics.

[11] Raymond K. Auerbach,et al. An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.