Flexible models for understanding and optimizing complex populations

Data analysis is often driven by the goals of understanding or optimizing some population of interest. The first of these two objectives aims to produce insights regarding characteristics of the underlying population, often to facilitate scientific understanding. Crucially, this requires models which produce results that are highly interpretable to the analyst. On the other hand, notions of interpretability are not necessarily as central for determining how to optimize populations, where the aim is to build datadriven systems which learn how to act upon individuals in a manner that maximally improves certain outcomes of interest across the population. In this thesis, we develop interpretable yet flexible modeling frameworks for addressing the former goal, as well as black-box nonparametric methods for addressing the latter. Throughout, we demonstrate various empirical applications of our algorithms, primarily in the biological context of modeling gene expression in large cell populations. For better understanding populations, we introduce two nonparametric models that can accurately reflect interesting characteristics of complex distributions without reliance on restrictive assumptions, while simultaneously remaining highly interpretable through their use of the Wasserstein (optimal transport) metric to summarize changes over an entire population. One approach is principal differences analysis, a projection-based technique that interpretably characterizes differences between two arbitrary high-dimensional probability distributions. Another approach is the TRENDS model, which quantifies the underlying effects of temporal progression in an evolving sequence of distributions that also vary due to confounding noise. While the aforementioned techniques fall under the frequentist regime, we subsequently present a Bayesian framework for the task of optimizing populations. Drawing upon the Gaussian process toolkit, our method learns how to best conservatively intervene upon heterogeneous populations in settings with limited data and substantial uncertainty about the underlying relationship between actions and outcomes. Thesis Supervisor: Tommi Jaakkola Thesis Supervisor: David Gifford Professor of EECS Professor of EECS

[1]  S. Martinez,et al.  Expression pattern of the lipocalin Apolipoprotein D during mouse embryogenesis , 2002, Mechanisms of Development.

[2]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[3]  M. Myers,et al.  The Zinc Transporter, Slc39a7 (Zip7) Is Implicated in Glycaemic Control in Skeletal Muscle Cells , 2013, PloS one.

[4]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[5]  A. Roses,et al.  The genetic contributions of SNCA and LRRK2 genes to Lewy Body pathology in Alzheimer's disease. , 2014, Human molecular genetics.

[6]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[7]  Steven L. Scott,et al.  Inferring causal impact using Bayesian structural time-series models , 2015, 1506.00356.

[8]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[9]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[10]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11]  Clark Glymour Causal Discovery , 2010, Encyclopedia of Machine Learning.

[12]  A. Brice,et al.  Parkinson's disease: from monogenic forms to genetic susceptibility factors. , 2009, Human molecular genetics.

[13]  J. A. Cuesta-Albertos,et al.  A Sharp Form of the Cramér–Wold Theorem , 2007 .

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Dong Hoon Oh,et al.  Increased Glycogen Synthase Kinase-3β mRNA Level in the Hippocampus of Patients with Major Depression: A Study Using the Stanley Neuropathology Consortium Integrative Database , 2010, Psychiatry investigation.

[16]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[17]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[18]  C. Kruse,et al.  The role of α-smooth muscle actin in myogenic differentiation of human glandular stem cells and their potential for smooth muscle cell replacement therapies , 2010, Expert opinion on biological therapy.

[19]  Leo J. Th. van der Kamp,et al.  Longitudinal Data Analysis: Designs, Models and Methods , 1999 .

[20]  Xiuyun Guo,et al.  Identifying Tmem59 related gene regulatory network of mouse neural stem cell from a compendium of expression profiles , 2011, BMC Systems Biology.

[21]  Hossein Mobahi,et al.  Seeing through the blur , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Moshe Shaked,et al.  Stochastic orders and their applications , 1994 .

[23]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[24]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[25]  M. Seto,et al.  A WNT/β-Catenin Signaling Activator, R-spondin, Plays Positive Regulatory Roles during Skeletal Myogenesis* , 2011, The Journal of Biological Chemistry.

[26]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[27]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[28]  Zizhen Yao,et al.  Tissue-specific splicing of a ubiquitously expressed transcription factor is essential for muscle differentiation. , 2013, Genes & development.

[29]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[30]  Hadley Wickham,et al.  Graphics for Statistics and Data Analysis with R , 2010 .

[31]  H. Wold,et al.  Some Theorems on Distribution Functions , 1936 .

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Paulo Cortez,et al.  A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News , 2015, EPIA.

[34]  Stefan G. E. Roberts,et al.  WT1 and its transcriptional cofactor BASP1 redirect the differentiation pathway of an established blood cell line , 2011, The Biochemical journal.

[35]  Tommi S. Jaakkola,et al.  Sequence to Better Sequence: Continuous Revision of Combinatorial Structures , 2017, ICML.

[36]  S. Teichmann,et al.  A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications , 2017, Genome Medicine.

[37]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[38]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[39]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[40]  Sean C. Bendall,et al.  Conditional density-based analysis of T cell signaling in single-cell data , 2014, Science.

[41]  Orkun S. Soyer,et al.  The Details in the Distributions: Why and How to Study Phenotypic Variability This Review Comes from a Themed Issue on Systems Biology Experimental Methods for Studying Phenotypic Variability within Genotype Variability between Genotype Variation between Plate Technical Variation between Environment , 2022 .

[42]  Naomi S. Altman,et al.  Bandwidth selection for kernel distribution function estimation , 1995 .

[43]  P. Arlotta,et al.  Neuronal subtype specification in the cerebral cortex , 2007, Nature Reviews Neuroscience.

[44]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[45]  Ralf Herwig,et al.  ConsensusPathDB: toward a more complete picture of cell biology , 2010, Nucleic Acids Res..

[46]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[47]  L. Gordon,et al.  Tutorial on large deviations for the binomial distribution , 1989 .

[48]  Jan de Leeuw,et al.  Correctness of Kruskal's algorithms for monotone regression with ties , 1977 .

[49]  Xi Chen,et al.  Cholecystokinin from the entorhinal cortex enables neural plasticity in the auditory cortex , 2013, Cell Research.

[50]  Rob J Hyndman,et al.  Sample Quantiles in Statistical Packages , 1996 .

[51]  Philip Heidelberger,et al.  Quantile Estimation in Dependent Sequences , 1984, Oper. Res..

[52]  G. Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Permutation P -values Should Never Be Zero: Calculating Exact P -values When Permutations Are Randomly Drawn , 2011 .

[53]  R. Dykstra,et al.  A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces , 1986 .

[54]  Zhaoran Wang,et al.  Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time , 2014, NIPS.

[55]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[56]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[57]  Carl E. Rasmussen,et al.  Gaussian Process Training with Input Noise , 2011, NIPS.

[58]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[59]  Alexander J. Smola,et al.  Heteroscedastic Gaussian process regression , 2005, ICML.

[60]  K. Ressler,et al.  Distinct subtypes of cholecystokinin (CCK)-containing interneurons of the basolateral amygdala identified using a CCK promoter-specific lentivirus. , 2009, Journal of neurophysiology.

[61]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[62]  J. Kleinberg,et al.  Prediction Policy Problems. , 2015, The American economic review.

[63]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[64]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[65]  P. Hande Özdinler,et al.  Corticospinal Motor Neurons Are Susceptible to Increased ER Stress and Display Profound Degeneration in the Absence of UCHL1 Function , 2015, Cerebral cortex.

[66]  Dimitri P. Bertsekas,et al.  Dual coordinate step methods for linear network flow problems , 1988, Math. Program..

[67]  Masaaki Tsuda,et al.  Developmental expression of the SRF co‐activator MAL in brain: role in regulating dendritic morphology , 2006, Journal of neurochemistry.

[68]  Steven Ruggles,et al.  Integrated Public Use Microdata Series: Version 3 , 2003 .

[69]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[70]  Klaus-Armin Nave,et al.  Neuronal Basic Helix–Loop–Helix Proteins Neurod2/6 Regulate Cortical Commissure Formation before Midline Interactions , 2013, The Journal of Neuroscience.

[71]  Mark W. Schmidt,et al.  Causal learning without DAGs , 2008, NIPS Causality: Objectives and Assessment.

[72]  G. Székely,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .

[73]  D. Lizotte Practical bayesian optimization , 2008 .

[74]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[75]  J. S. Marron,et al.  Direction-Projection-Permutation for High-Dimensional Hypothesis Tests , 2013, 1304.0796.

[76]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[77]  Harry van Zanten,et al.  Information Rates of Nonparametric Gaussian Process Methods , 2011, J. Mach. Learn. Res..

[78]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[79]  Moritz Jirak On the maximum of covariance estimators , 2011, J. Multivar. Anal..

[80]  Paola Brambilla,et al.  Prostaglandin D2 synthase/GPR44: a signaling axis in PNS myelination , 2014, Nature Neuroscience.

[81]  Michael Lindenbaum,et al.  Nonnegative Matrix Factorization with Earth Mover's Distance Metric for Image Analysis , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[83]  M. Cairns,et al.  Transcriptome Sequencing Revealed Significant Alteration of Cortical Promoter Usage and Splicing in Schizophrenia , 2012, PloS one.

[84]  J. Rossier,et al.  Cerebral Cortex doi:10.1093/cercor/bhj081 Cortical Sources of CRF, NKB, and CCK and Their Effects on Pyramidal Cells , 2005 .

[85]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[86]  Sebastiano Cavallaro,et al.  Pathways and genes differentially expressed in the motor cortex of patients with sporadic amyotrophic lateral sclerosis , 2007, BMC Genomics.

[87]  E. Segal,et al.  Personalized Nutrition by Prediction of Glycemic Responses , 2015, Cell.

[88]  Mehdi M. Kashani,et al.  Large-Scale Genetic Perturbations Reveal Regulatory Networks and an Abundance of Gene-Specific Repressors , 2014, Cell.