论文信息 - Simultaneous inference for RNA-Seq data - 字舞流文

Simultaneous inference for RNA-Seq data

In the last few years, RNA-Seq has become a popular choice for high-throughput studies of gene expression, revealing its potential to overcome microarrays and become the new standard for transcriptional profiling. At a gene-level, RNA-Seq yields counts rather than continuous measures of expression, leading to the need for novel methods to deal with count data in high-dimensional problems. In this Thesis, we aim at shedding light on the problems related to the exploration and modeling of RNA-Seq data. In particular, we introduce simple and effective ways to summarize and visualize the data; we define a novel algorithm for the clustering of RNA-Seq data and we implement simple normalization strategies to deal with technology-related biases. Finally, we present a hierarchical Bayesian approach to the modeling of RNA-Seq data. The model accounts for the difference in sequencing depth, as well as for overdispersion, automatically accounting for different types of normalization.

Davide Risso | D. Risso

[1] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2] Gordon K Smyth,et al. Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[3] Trevor Hastie,et al. Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[4] Wei Zheng,et al. Bias detection and correction in RNA-Sequencing data , 2011, BMC Bioinformatics.

[5] S. Dudoit,et al. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[6] Ingrid Lönnstedt. Replicated microarray data , 2001 .

[7] Martin N. Rossor,et al. Advanced online publication. , 2005, Nature structural biology.

[8] R. A. Kempton,et al. Log-Series and Log-Normal Parameters as Diversity Discriminants for the Lepidoptera , 1974 .

[9] W. Cleveland,et al. Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[10] Paolo G. V. Martini,et al. Statistical Test of Expression Pattern (STEPath): a new strategy to integrate gene expression data with genomic information in individual and meta-analysis studies , 2011, BMC Bioinformatics.

[11] S. Dudoit,et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[12] Chiara Romualdi,et al. A-MADMAN: Annotation-based microarray data meta-analysis tool , 2009, BMC Bioinformatics.

[13] J. Lawless. Negative binomial and mixed Poisson regression , 1987 .

[14] David Tritchler,et al. Filtering Genes for Cluster and Network Analysis , 2009, BMC Bioinformatics.

[15] J. Ibrahim,et al. Bayesian Models for Gene Expression With DNA Microarray Data , 2002 .

[16] I. Good. The Bayes/Non-Bayes Compromise: A Brief Review , 1992 .

[17] Nicola Torelli,et al. Preserving the Clustering Structure by a Projection Pursuit Approach , 2010 .

[18] David P. Dobkin,et al. The quickhull algorithm for convex hulls , 1996, TOMS.

[19] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[20] Sandrine Dudoit,et al. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[21] F. Crick. Central Dogma of Molecular Biology , 1970, Nature.

[22] F. J. Anscombe,et al. THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[23] Matthew D. Young,et al. Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[24] D. Slonim. From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[25] Kenny Q. Ye,et al. Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[26] Mark de Berg,et al. Computational geometry: algorithms and applications , 1997 .

[27] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[28] John A. Hartigan,et al. Clustering Algorithms , 1975 .

[29] A. Oshlack,et al. Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[30] A. Raftery,et al. Model-based Gaussian and non-Gaussian clustering , 1993 .

[31] Adrian E. Raftery,et al. Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[32] Inderjit S. Dhillon,et al. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[33] K. Hansen,et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[34] M. Salit,et al. Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .

[35] H. Rue,et al. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .

[36] Rafael A Irizarry,et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[37] F. J. Anscombe,et al. Sampling theory of the negative binomial and logarithmic series distributions. , 1950, Biometrika.

[38] Geoffrey J. McLachlan,et al. A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[39] B. Williams,et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[40] Nicola Torelli,et al. Clustering via nonparametric density estimation , 2007, Stat. Comput..

[41] Jean YH Yang,et al. Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[42] Emmanuel Barillot,et al. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[43] Juliane C. Dohm,et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[44] Nancy F. Hansen,et al. Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[45] I. Johnstone,et al. On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[46] M. Bartlett. The Square Root Transformation in Analysis of Variance , 1936 .

[47] Krista Rizman Zalik,et al. Biclustering of gene expression data , 2005 .

[48] Gabriele Sales,et al. A Hierarchical Bayesian Model for RNA-Seq Data , 2013 .

[49] M. Stephens,et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[50] R. Tibshirani,et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[51] J. Mesirov,et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[52] D. Cox,et al. Parameter Orthogonality and Approximate Conditional Inference , 1987 .

[53] K. Hansen,et al. Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[54] Mark D. Robinson,et al. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[55] Jean-Philippe Vert,et al. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[56] Robert Gentleman,et al. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data , 2009, Bioinform..

[57] Heather J. Ruskin,et al. Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[58] Gordon K. Smyth,et al. Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[59] W. Huber,et al. which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[60] Davide Risso,et al. A novel approach to the clustering of microarray data via nonparametric density estimation , 2011, BMC Bioinformatics.

[61] G. Sherlock,et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads , 2010, BMC Genomics.

[62] M. Robinson,et al. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.