A hierarchical mixture of Markov models for finding biologically active metabolic paths using gene expression and protein classes

With the recent development of experimental high-throughput techniques, the type and volume of accumulating biological data have extremely increased these few years. Mining from different types of data might lead us to find new biological insights. We present a new methodology for systematically combining three different datasets to find biologically active metabolic paths/patterns. This method consists of two steps: first it synthesizes metabolic paths from a given set of chemical reactions, which are already known and whose enzymes are co-expressed, in an efficient manner. It then represents the obtained metabolic paths in a more comprehensible way through estimating parameters of a probabilistic model by using these synthesized paths. This model is built upon an assumption that an entire set of chemical reactions corresponds to a Markov state transition diagram. Furthermore, this model is a hierarchical latent variable model, containing a set of protein classes as a latent variable, for clustering input paths in terms of existing knowledge of protein classes. We tested the performance of our method using a main pathway of glycolysis, and found that our method achieved higher predictive performance for the issue of classifying gene expressions than those obtained by other unsupervised methods. We further analyzed the estimated parameters of our probabilistic models, and found that biologically active paths were clustered into only two or three patterns for each expression experiment type, and each pattern suggested some new long-range relations in the glycolysis pathway.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Dimitris J. Bertsimas,et al.  Dynamic Classification of Online Customers , 2003, SDM.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Christopher M. Bishop,et al.  A Hierarchical Latent Variable Model for Data Visualization , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Hidde de Jong,et al.  Modeling and Simulation of Genetic Regulatory Systems: A Literature Review , 2002, J. Comput. Biol..

[6]  H. Westerhoff,et al.  Transcriptome meets metabolome: hierarchical and metabolic regulation of the glycolytic pathway , 2001, FEBS letters.

[7]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[8]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[9]  Diane Gershon,et al.  Microarray technology: An array of opportunities , 2002, Nature.

[10]  Jan Ihmels,et al.  Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae , 2004, Nature Biotechnology.

[11]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[12]  Atsuko Yamaguchi,et al.  Mining biologically active patterns in metabolic pathways using microarray expression profiles , 2003, SKDD.

[13]  B. Palsson,et al.  Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. , 2003, Genome research.

[14]  Peter D. Karp,et al.  The EcoCyc Database , 2002, Nucleic Acids Res..

[15]  Chi-Huey Wong,et al.  Enzymes for chemical synthesis , 2001, Nature.

[16]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[17]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[18]  Edward R. Dougherty,et al.  From Boolean to probabilistic Boolean networks as models of genetic regulatory networks , 2002, Proc. IEEE.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Haidong Wang,et al.  Discovering molecular pathways from protein interaction and gene expression data , 2003, ISMB.

[21]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms. By Pierre Baldi, Paolo Frasconi, Padhraic Smith, John Wiley and Sons Ltd., West Sussex, England, 2003. 285 pp ISBN 0 470 84906 1 , 2006, Inf. Process. Manag..

[23]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[24]  David Botstein,et al.  The Stanford Microarray Database , 2001, Nucleic Acids Res..

[25]  Lynda B. M. Ellis,et al.  The University of Minnesota Biocatalysis/Biodegradation Database: emphasizing enzymes , 2001, Nucleic Acids Res..

[26]  T. Speed,et al.  Biological Sequence Analysis , 1998 .