Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings

Recent studies compare gene expression data across species to identify core and species specific genes in biological systems. To perform such comparisons researchers need to match genes across species. This is a challenging task since the correct matches (orthologs) are not known for most genes. Previous work in this area used deterministic matchings or reduced multidimensional expression data to binary representation. Here we develop a new method that can utilize soft matches (given as priors) to infer both, unique and similar expression patterns across species and a matching for the genes in both species. Our method uses a Dirichlet process mixture model which includes a latent data matching variable. We present learning and inference algorithms based on variational methods for this model. Applying our method to immune response data we show that it can accurately identify common and unique response patterns by improving the matchings between human and mouse genes.

[1]  Stanley Falkow,et al.  Host microarray analysis reveals a role for the Salmonella response regulator phoP in human macrophage cell death , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[3]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[4]  A. Kushch,et al.  [Activation of transcription of ribosome genes following human embryo fibroblast infection with cytomegalovirus in vitro]. , 2003, Tsitologiia.

[5]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[6]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[7]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[8]  S. Falkow,et al.  Salmonella typhimurium invasion induces apoptosis in infected macrophages. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Ziv Bar-Joseph,et al.  Identifying cycling genes by combining sequence homology and expression data , 2006, ISMB.

[10]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[11]  Joshua M. Stuart,et al.  Conserved Genetic Modules 5 / 29 / 2003 1 A gene co-expression network for global discovery of conserved genetic modules in H . sapiens , D . melanogaster , C . elegans , and S . cerevisiae , 2003 .

[12]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[13]  Ziv Bar-Joseph,et al.  Cross Species Expression Analysis of Innate Immune Response , 2009, J. Comput. Biol..

[14]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[15]  Yee Whye Teh,et al.  A mixture model for the evolution of gene expression in non-homogeneous datasets , 2008, NIPS.

[16]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[17]  P. Bork,et al.  Co-evolution of transcriptional and post-translational cell-cycle regulation , 2006, Nature.

[18]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[19]  A. Chaudhuri,et al.  A gene signature for post-infectious chronic fatigue syndrome , 2009, BMC Medical Genomics.

[20]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[21]  Catherine Etchebest,et al.  Genome adaptation to chemical stress: clues from comparative transcriptomics in Saccharomyces cerevisiae and Candida glabrata , 2008, Genome Biology.

[22]  Reinhard Hoffmann,et al.  Role of strain differences on host resistance and the transcriptional response of macrophages to infection with Yersinia enterocolitica. , 2006, Physiological genomics.