Identifying pathogenic processes by integrating microarray data with prior

Background: It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various high-throughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genome-wide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of co-regulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping. Results: Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation. Conclusion: Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways.

[1]  C. Carlson,et al.  Differential regulation of extracellular matrix constituents in myocardial remodeling with and without heart failure following pressure overload. , 2013, Matrix biology : journal of the International Society for Matrix Biology.

[2]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[3]  G. Xiong,et al.  RORα suppresses breast tumor invasion by inducing SEMA3F expression. , 2012, Cancer research.

[4]  Benjamin Haibe-Kains,et al.  Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks , 2011, Nucleic Acids Res..

[5]  Xujing Wang,et al.  Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data , 2011, BMC Bioinformatics.

[6]  G. K. Sandve,et al.  The Genomic HyperBrowser: inferential genomics at the sequence level , 2010, Genome Biology.

[7]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..

[8]  Jason B. Ernst,et al.  Integrating multiple evidence sources to predict transcription factor binding in the human genome. , 2010, Genome research.

[9]  Gregory M. Fomovsky,et al.  Contribution of extracellular matrix to the mechanical properties of the heart. , 2010, Journal of molecular and cellular cardiology.

[10]  Stephanie L. K. Bowers,et al.  The extracellular matrix: at the center of it all. , 2010, Journal of molecular and cellular cardiology.

[11]  Jotun Hein,et al.  Dynamic and Physical Clustering of Gene Expression during Epidermal Barrier Formation in Differentiating Keratinocytes , 2009, PloS one.

[12]  E. Fraenkel,et al.  Integrating Proteomic, Transcriptional, and Interactome Data Reveals Hidden Components of Signaling and Regulatory Networks , 2009, Science Signaling.

[13]  J. Ingwall Energy metabolism in heart failure and remodelling. , 2008, Cardiovascular research.

[14]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[15]  John Quackenbush,et al.  Seeded Bayesian Networks: Constructing genetic networks from microarray data , 2008, BMC Systems Biology.

[16]  Tobias Müller,et al.  Identifying functional modules in protein–protein interaction networks: an integrated exact approach , 2008, ISMB.

[17]  G. Casella,et al.  Clustering using objective functions and stochastic search , 2008 .

[18]  T. Werner Bioinformatics applications for pathway analysis of microarray data. , 2008, Current opinion in biotechnology.

[19]  J. Sottile,et al.  Fibronectin-dependent collagen I deposition modulates the cell response to fibronectin. , 2007, American journal of physiology. Cell physiology.

[20]  Peter S. Swain,et al.  Facile: a command-line network compiler for systems biology , 2007, BMC Systems Biology.

[21]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[22]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[23]  A. Thalamuthu,et al.  Evaluation and comparison of gene clustering methods in microarray analysis , 2006, Bioinform..

[24]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[25]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[26]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[27]  M. Simon,et al.  Degradation of corneodesmosome proteins by two serine proteases of the kallikrein family, SCTE/KLK5/hK5 and SCCE/KLK7/hK7. , 2004, The Journal of investigative dermatology.

[28]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[29]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[30]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[31]  Lada A. Adamic,et al.  A literature based method for identifying gene-disease connections , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[32]  D. Kelly,et al.  Gene Regulatory Mechanisms Governing Energy Metabolism during Cardiac Hypertrophic Growth , 2002, Heart Failure Reviews.

[33]  D. Kelly,et al.  Transcriptional Activation Of Energy Metabolic Switches In The Developing And Hypertrophied Heart , 2002, Clinical and experimental pharmacology & physiology.

[34]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[35]  M. Schnölzer,et al.  Interaction of plakophilins with desmoplakin and intermediate filament proteins: an in vitro analysis. , 2000, Journal of cell science.

[36]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[37]  Michal Linial,et al.  Using Bayesian networks to analyze expression data , 2000, RECOMB '00.

[38]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[39]  Elizabeth A. Smith,et al.  Defining the Interactions Between Intermediate Filaments and Desmosomes , 1998, The Journal of cell biology.

[40]  M. Stearns Alendronate blocks TGF-β1 stimulated collagen 1 degradation by human prostate PC-3 ML cells , 1998, Clinical & Experimental Metastasis.

[41]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[42]  P. Steinert,et al.  Two-hybrid Analysis Reveals Fundamental Differences in Direct Interactions between Desmoplakin and Cell Type-specific Intermediate Filaments* , 1997, The Journal of Biological Chemistry.

[43]  E. Fuchs,et al.  Making a connection: direct binding between keratin intermediate filaments and desmosomal proteins , 1994, The Journal of cell biology.

[44]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[45]  M. Oja,et al.  Expression Data , 2001 .

[46]  R. Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[47]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[48]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[49]  Y. Dodge on Statistical data analysis based on the L1-norm and related methods , 1987 .

[50]  Thomas Lengauer,et al.  Bioinformatics Original Paper Improved Scoring of Functional Groups from Gene Expression Data by Decorrelating Go Graph Structure , 2022 .