A machine learning approach to query time-series microarray data sets for functionally related genes using hidden markov models

Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field. The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research. This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray. Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene. A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals. This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets. The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker.s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature.

[1]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[2]  Philip S. Yu,et al.  A graph-based approach to systematically reconstruct human transcriptional regulatory modules , 2007, ISMB/ECCB.

[3]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[4]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[5]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[6]  Gerhard G. Thallinger,et al.  ArrayNorm: comprehensive normalization and analysis of microarray data , 2004, Bioinform..

[7]  Satoru Miyano,et al.  Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models , 2008, Bioinform..

[8]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Michael R. Thon,et al.  Identifying clusters of functionally related genes in genomes , 2007, Bioinform..

[10]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[11]  B. S. Baker,et al.  Gene Expression During the Life Cycle of Drosophila melanogaster , 2002, Science.

[12]  Blatt,et al.  Superparamagnetic clustering of data. , 1998, Physical review letters.

[13]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[15]  B. Snel,et al.  The identification of functional modules from the genomic association of genes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[17]  Jun Zhu,et al.  Identifying differentially expressed genes in human acute leukemia and mouse brain microarray datasets utilizing QTModel , 2009, Functional & Integrative Genomics.

[18]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[19]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[20]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  Martin Vingron,et al.  Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis , 2007, Bioinform..

[23]  D. di Bernardo,et al.  How to infer gene networks from expression profiles , 2007, Molecular systems biology.

[24]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Christian Pilarsky,et al.  Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes , 2005, Oncogene.

[26]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  H. Kitano Systems Biology: A Brief Overview , 2002, Science.

[28]  Werner Paulus,et al.  Drosophila melanogaster as a Model Organism of Brain Diseases , 2009, International journal of molecular sciences.

[29]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[30]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[31]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[32]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[33]  Ole Winther,et al.  Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[34]  Sorin Istrail,et al.  Logic Functions of the Genomic Cis-regulatory Code , 2005, UC.

[35]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[36]  Taesung Park,et al.  Statistical tests for identifying differentially expressed genes in time-course microarray experiments , 2003, Bioinform..

[37]  Brett A. McKinney,et al.  Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS , 2011, Bioinform..

[38]  Nicolas E. Buchler,et al.  On schemes of combinatorial transcription logic , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[40]  Rainer Breitling,et al.  A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments , 2008, Bioinform..

[41]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[42]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[43]  Ethan Bier,et al.  Drosophila, the golden bug, emerges as a tool for human genetics , 2005, Nature Reviews Genetics.

[44]  J. Gern The Sequence of the Human Genome , 2001, Science.

[45]  S. Squazzo,et al.  A computational genomics approach to identify cis-regulatory modules from chromatin immunoprecipitation microarray data--a case study using E2F1. , 2006, Genome research.

[46]  J. Blake,et al.  The Gene Ontology (GO) Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis , 2008, Current protocols in bioinformatics.

[47]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[48]  Xue-wen Chen,et al.  Identification of genes involved in the same pathways using a Hidden Markov Model-based approach , 2009, Bioinform..

[49]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Albert-László Barabási,et al.  Hierarchical organization in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  See-Kiong Ng,et al.  On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. , 2003, Genome informatics. International Conference on Genome Informatics.

[52]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[53]  Weiguo Liu,et al.  A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware , 2010, J. Comput. Biol..

[54]  Pieter Rein ten Wolde,et al.  Transcriptional Regulation by Competing Transcription Factor Modules , 2006, PLoS Comput. Biol..

[55]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[56]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[57]  Erik L. L. Sonnhammer,et al.  Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER , 2005, BMC Bioinformatics.

[58]  Kathleen Marchal,et al.  Query-driven module discovery in microarray data , 2007, Bioinform..

[59]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[60]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[61]  P. Kemmeren,et al.  Protein interaction verification and functional annotation by integrated analysis of genome-scale data. , 2002, Molecular cell.

[62]  Eduardo Sontag,et al.  Untangling the wires: A strategy to trace functional interactions in signaling and gene networks , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[63]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Tobias Müller,et al.  Identifying functional modules in protein–protein interaction networks: an integrated exact approach , 2008, ISMB.

[65]  M Gribskov,et al.  A systematic analysis of human disease-associated gene sequences in Drosophila melanogaster. , 2001, Genome research.

[66]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[67]  Simon Kasif,et al.  GEMS: a web server for biclustering analysis of expression data , 2005, Nucleic Acids Res..

[68]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[69]  A. Owen,et al.  A gene recommender algorithm to identify coexpressed genes in C. elegans. , 2003, Genome research.

[70]  A. Clark,et al.  Evolutionary constraint and adaptation in the metabolic network of Drosophila. , 2008, Molecular biology and evolution.

[71]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[72]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000 .

[73]  Jonathan D. Wren,et al.  A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide , 2009, Bioinform..

[74]  Qingxia Chen,et al.  Missing covariate data in medical research: to impute is better than to ignore. , 2010, Journal of clinical epidemiology.

[75]  S. L. Wong,et al.  Combining biological networks to predict genetic interactions. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[77]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[78]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[79]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[80]  T. Kohonen Adaptive, associative, and self-organizing functions in neural computing. , 1987, Applied optics.

[81]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[82]  Huai Li,et al.  Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data , 2008, Bioinform..

[83]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[84]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[85]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[86]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[87]  Jesper Tegnér,et al.  Growing Bayesian network models of gene networks from seed genes , 2005, ECCB/JBI.

[88]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[89]  Ying Xu,et al.  Prediction of functional modules based on comparative genome analysis and Gene Ontology application , 2005, Nucleic acids research.

[90]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[91]  Martin A. Nowak,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004 .

[92]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[93]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[94]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[95]  George M Church,et al.  A network of transcriptionally coordinated functional modules in Saccharomyces cerevisiae. , 2005, Genome research.

[96]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[97]  H. Mewes,et al.  Functional modules by relating protein interaction networks and gene expression. , 2003, Nucleic acids research.

[98]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[99]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[100]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[101]  E. Davidson,et al.  Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. , 1998, Science.

[102]  Kevin R. Coombes,et al.  Identifying Differentially Expressed Genes in cDNA Microarray Experiments , 2001, J. Comput. Biol..

[103]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[104]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[105]  Joaquín Dopazo,et al.  Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. , 2002, Journal of proteome research.

[106]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..