Lineage-based identification of cellular states and expression programs

Summary: We present a method, LineageProgram, that uses the developmental lineage relationship of observed gene expression measurements to improve the learning of developmentally relevant cellular states and expression programs. We find that incorporating lineage information allows us to significantly improve both the predictive power and interpretability of expression programs that are derived from expression measurements from in vitro differentiation experiments. The lineage tree of a differentiation experiment is a tree graph whose nodes describe all of the unique expression states in the input expression measurements, and edges describe the experimental perturbations applied to cells. Our method, LineageProgram, is based on a log-linear model with parameters that reflect changes along the lineage tree. Regularization with L1 that based methods controls the parameters in three distinct ways: the number of genes change between two cellular states, the number of unique cellular states, and the number of underlying factors responsible for changes in cell state. The model is estimated with proximal operators to quickly discover a small number of key cell states and gene sets. Comparisons with existing factorization, techniques, such as singular value decomposition and non-negative matrix factorization show that our method provides higher predictive power in held, out tests while inducing sparse and biologically relevant gene sets. Contact: gifford@mit.edu

[1]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[3]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[4]  T. Jessell Neuronal specification in the spinal cord: inductive signals and transcriptional codes , 2000, Nature Reviews Genetics.

[5]  Xi C. He,et al.  Transcriptional accessibility for genes of multiple tissues and hematopoietic lineages is hierarchically controlled during early hematopoiesis. , 2003, Blood.

[6]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[7]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[8]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[9]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[10]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[11]  Francisco Tirado,et al.  Modulating the Expression of Disease Genes with RNA-Based Therapy , 2006, BMC Bioinformatics.

[12]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Byoung-Tak Zhang,et al.  Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation , 2006, Bioinform..

[14]  M. C. Jørgensen,et al.  An illustrated review of early pancreas development in the mouse. , 2007, Endocrine reviews.

[15]  F. Ferrari,et al.  Genomic expression during human myelopoiesis , 2007, BMC Genomics.

[16]  Ivan G. Costa,et al.  Gene expression trees in lymphoid development , 2007, BMC Immunology.

[17]  Tommi S. Jaakkola,et al.  Automated Discovery of Functional Generality of Human Gene Expression Programs , 2007, PLoS Comput. Biol..

[18]  Alexander Schliep,et al.  Inferring differentiation pathways from gene expression , 2008, ISMB.

[19]  Kai Wang,et al.  INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY (ISMB) , 2009 .

[20]  Jieping Ye,et al.  An accelerated gradient method for trace norm minimization , 2009, ICML '09.

[21]  Kit T. Rodolfa,et al.  Sox17 promotes differentiation in mouse embryonic stem cells by directly regulating extraembryonic gene expression and indirectly antagonizing self-renewal. , 2010, Genes & development.

[22]  Beibei Chen,et al.  Estimating developmental states of tumors and normal tissues using a linear time-ordered model , 2011, BMC Bioinformatics.

[23]  Chengyu Liu,et al.  Biclustering of gene expression data by non-smooth non-negative matrix factorization , 2010 .

[24]  Martin von Bergen,et al.  Expression cartography of human tissues using self organizing maps , 2011, BMC Bioinformatics.

[25]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[26]  Eric P. Xing,et al.  Online Learning of Structured Predictors with Multiple Kernels , 2011, AISTATS.

[27]  Blaz Zupan,et al.  Stage prediction of embryonic stem cell differentiation from genome-wide expression data , 2011, Bioinform..

[28]  Carolyn A. Morrison,et al.  Embryonic stem cell-based system for mapping developmental transcriptional programs , 2012 .