Modeling temporally-regulated effects on distributions

We present a nonparametric framework for modeling an evolving sequence of (estimated) probability distributions which distinguishes the effects of sequential progression on the observed distribution from extraneous sources of noise (i.e. latent variables which perturb the distributions independently of the sequence-index). To discriminate between these two types of variation, our methods leverage the underlying assumption that the effects of sequential-progression follow a consistent trend. Our methods are motivated by the recent rise of single-cell RNA-sequencing time course experiments, in which an important analytic goal is the identification of genes relevant to the progression of a biological process of interest at cellular resolution. As existing statistical tools are not suited for this task, we introduce a new regression model for (ordinal-value , univariate-distribution) covariate-response pairs where the class of regression-functions reflects coherent changes to the distributions over increasing levels of the covariate, a concept we refer to as trends in distributions. Through simulation study and extensive application of our ideas to data from recent singlecell gene-expression time course experiments, we demonstrate numerous strengths of our framework. Finally, we characterize both theoretical properties of the proposed estimators and the generality of our trend-assumption across diverse types of underlying sequential-progression effects, thus highlighting the utility of our framework for a wide variety of other applications involving the analysis of distributions with associated ordinal labels. Thesis Supervisor: Tommi S. Jaakkola Title: Professor of Electrical Engineering and Computer Science Thesis Supervisor: David K. Gifford Title: Professor of Electrical Engineering and Computer Science

[1]  Rob J Hyndman,et al.  Estimating and Visualizing Conditional Densities , 1996 .

[2]  J. A. Cuesta-Albertos,et al.  Tests of goodness of fit based on the $L_2$-Wasserstein distance , 1999 .

[3]  M. Seto,et al.  A WNT/β-Catenin Signaling Activator, R-spondin, Plays Positive Regulatory Roles during Skeletal Myogenesis* , 2011, The Journal of Biological Chemistry.

[4]  Leo J. Th. van der Kamp,et al.  Longitudinal Data Analysis: Designs, Models and Methods , 1999 .

[5]  Alexander J. Smola,et al.  Nonparametric Quantile Estimation , 2006, J. Mach. Learn. Res..

[6]  Jianqing Fan,et al.  Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems , 1996 .

[7]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[8]  Moshe Shaked,et al.  Stochastic orders and their applications , 1994 .

[9]  N. Neff,et al.  Quantitative assessment of single-cell RNA-sequencing methods , 2013, Nature Methods.

[10]  I. Amit,et al.  Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types , 2014, Science.

[11]  Elmar Wolfstetter Stochastic Dominance: Applications , 1999 .

[12]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[13]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[14]  M. Apostolova,et al.  Metallothionein and apoptosis during differentiation of myoblasts to myotubes: protection against free radical toxicity. , 1999, Toxicology and applied pharmacology.

[15]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[16]  M. Basseville,et al.  Sequential Analysis: Hypothesis Testing and Changepoint Detection , 2014 .

[17]  Ryszard Zieliński Small-Sample Quantile Estimators in a Large Nonparametric Model , 2006 .

[18]  Orkun S. Soyer,et al.  The Details in the Distributions: Why and How to Study Phenotypic Variability This Review Comes from a Themed Issue on Systems Biology Experimental Methods for Studying Phenotypic Variability within Genotype Variability between Genotype Variation between Plate Technical Variation between Environment , 2022 .

[19]  René Bernards,et al.  TSPYL5 suppresses p53 levels and function by physical interaction with USP7 , 2011, Nature Cell Biology.

[20]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[21]  R. Dykstra,et al.  A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces , 1986 .

[22]  Sean C. Bendall,et al.  Conditional density-based analysis of T cell signaling in single-cell data , 2014, Science.

[23]  Giulia Piaggio,et al.  P53 Regulates Myogenesis by Triggering the Differentiation Activity of Prb , 2000, The Journal of cell biology.

[24]  W. Gilchrist,et al.  Statistical Modelling with Quantile Functions , 2000 .

[25]  Rob J Hyndman,et al.  Sample Quantiles in Statistical Packages , 1996 .

[26]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[27]  Hadley Wickham,et al.  Graphics for Statistics and Data Analysis with R , 2010 .

[28]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[29]  Jason Chuang,et al.  RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development , 2012, Genome research.

[30]  Ralf Herwig,et al.  ConsensusPathDB: toward a more complete picture of cell biology , 2010, Nucleic Acids Res..

[31]  L. Gordon,et al.  Tutorial on large deviations for the binomial distribution , 1989 .

[32]  D. Gifford,et al.  Differentiated human stem cells resemble fetal, not adult, β cells , 2014, Proceedings of the National Academy of Sciences.

[33]  Rodney C. Wolff,et al.  Methods for estimating a conditional distribution function , 1999 .

[34]  Pawel Zajac,et al.  Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing , 2012, Nature Protocols.

[35]  T. Jaakkola,et al.  Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Tommi S. Jaakkola,et al.  Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees , 2014, AISTATS.

[37]  Steven Ruggles,et al.  Integrated Public Use Microdata Series: Version 3 , 2003 .

[38]  Qiwei Yao,et al.  Approximating conditional distribution functions using dimension reduction , 2005 .

[39]  Rian,et al.  Non-crossing quantile regression curve estimation , 2010 .

[40]  A. Saliba,et al.  Single-cell RNA-seq: advances and future challenges , 2014, Nucleic acids research.

[41]  G. Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Permutation P -values Should Never Be Zero: Calculating Exact P -values When Permutations Are Randomly Drawn , 2011 .

[42]  Emmanuel Saez,et al.  Top Incomes and the Great Recession: Recent Evolutions and Policy Implications , 2013 .

[43]  D. Tranchina,et al.  Stochastic mRNA Synthesis in Mammalian Cells , 2006, PLoS biology.

[44]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[45]  Naomi S. Altman,et al.  Bandwidth selection for kernel distribution function estimation , 1995 .

[46]  A. Janssen,et al.  How do bootstrap and permutation tests work , 2003 .

[47]  Jan de Leeuw,et al.  Correctness of Kruskal's algorithms for monotone regression with ties , 1977 .

[48]  John E. Eriksson,et al.  Nestin as a regulator of Cdk5 in differentiating myoblasts , 2011, Molecular biology of the cell.

[49]  Michael J. Best,et al.  Active set algorithms for isotonic regression; A unifying framework , 1990, Math. Program..

[50]  James J. Chen,et al.  Kernel estimation for adjusted p , 2007, Comput. Stat. Data Anal..

[51]  M. Myers,et al.  The Zinc Transporter, Slc39a7 (Zip7) Is Implicated in Glycaemic Control in Skeletal Muscle Cells , 2013, PloS one.

[52]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[53]  Zizhen Yao,et al.  Tissue-specific splicing of a ubiquitously expressed transcription factor is essential for muscle differentiation. , 2013, Genes & development.

[54]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[55]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[56]  C. Kruse,et al.  The role of α-smooth muscle actin in myogenic differentiation of human glandular stem cells and their potential for smooth muscle cell replacement therapies , 2010, Expert opinion on biological therapy.