Towards knowledge-based gene expression data mining

The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks.

[1]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[2]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[3]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[4]  Julie Clayton,et al.  RNA interference: The silent treatment , 2004, Nature.

[5]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[6]  Mark J. van der Laan,et al.  A causal inference approach for constructing transcriptional regulatory networks , 2005, Bioinform..

[7]  Yuval Shahar,et al.  A Framework for Knowledge-Based Temporal Abstraction , 1997, Artif. Intell..

[8]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[9]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  S Fuhrman,et al.  Reveal, a general reverse engineering algorithm for inference of genetic network architectures. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[13]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[14]  Jan Komorowski,et al.  Learning Rule-based Models of Biological Process from Gene Expression Time Profiles Using Gene Ontology , 2003, Bioinform..

[15]  Satoru Miyano,et al.  Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection , 2003, ECCB.

[16]  Zhaohui S. Qin,et al.  Clustering microarray gene expression data using weighted Chinese restaurant process , 2006, Bioinform..

[17]  K. Sakamoto,et al.  RNA interference and human disease. , 2003, Molecular genetics and metabolism.

[18]  Ivan Bratko,et al.  Microarray data mining with visual programming , 2005, Bioinform..

[19]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[20]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[21]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[22]  Ronald W. Davis,et al.  Transcriptional regulation and function during the human cell cycle , 2001, Nature Genetics.

[23]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[24]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[25]  Timothy R Hughes Universal epistasis analysis , 2005, Nature Genetics.

[26]  Blaz Zupan,et al.  Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models , 2005, AIME.

[27]  Yuval Shahar,et al.  Multiple hierarchical classification of free-text clinical guidelines , 2006, Artif. Intell. Medicine.

[28]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[29]  Satoru Miyano,et al.  Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks , 2004, J. Bioinform. Comput. Biol..

[30]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[31]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Juho Rousu,et al.  Learning hierarchical multi-category text classification models , 2005, ICML.

[33]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[34]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[35]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[36]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[38]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[39]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[40]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2001, Graph Drawing Software.

[41]  Martin A. Nowak,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004 .

[42]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[43]  Thessa T. J. P. Kockelkorn,et al.  Mediator expression profiling epistasis reveals a signal transduction pathway with antagonistic submodules and highly specific downstream targets. , 2005, Molecular cell.

[44]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[45]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[46]  Peter J. Park,et al.  A multivariate approach for integrating genome-wide expression data and biological knowledge , 2006, Bioinform..

[47]  P. Sebastiani,et al.  Bayesian Networks for Genomic Analysis , 2004 .

[48]  PanWei,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006 .

[49]  Trupti Joshi,et al.  Inferring gene regulatory networks from multiple microarray datasets , 2006, Bioinform..

[50]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[51]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[52]  Satoru Miyano,et al.  Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data , 2005, ECCB/JBI.

[53]  Nada Lavrac,et al.  Induction of comprehensible models for gene expression datasets by subgroup discovery methodology , 2004, J. Biomed. Informatics.

[54]  Bruno Torrésani,et al.  Comments on selected fundamental aspects of microarray analysis , 2005, Comput. Biol. Chem..

[55]  David Heckerman,et al.  Learning Gaussian Networks , 1994, UAI.

[56]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[57]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[58]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2006, Nucleic Acids Res..

[59]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[60]  Blaz Zupan,et al.  Knowledge-based data analysis and interpretation , 2006, Artif. Intell. Medicine.

[61]  Mats G. Gustafsson,et al.  Bayesian detection of periodic mRNA time profiles without use of training examples , 2006, BMC Bioinformatics.

[62]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[63]  Lyle H. Ungar,et al.  Using prior knowledge to improve genetic network reconstruction from microarray data , 2004, Silico Biol..

[64]  Shao Li,et al.  Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach , 2006, Bioinform..

[65]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[66]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[67]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[68]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[69]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[70]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[71]  Alvis Brazma,et al.  Modelling gene networks at different organisational levels , 2005, FEBS letters.

[72]  D. Hand,et al.  Finding Groups in Gene Expression Data , 2005, Journal of biomedicine & biotechnology.

[73]  Francisco Azuaje,et al.  A knowledge-driven approach to cluster validity assessment , 2005, Bioinform..

[74]  Pedro Larrañaga,et al.  Learning Bayesisan Networks by Genetic Algorithms: A Case Study in the Prediction of Survival in Malignant Skin Melanoma , 1997, AIME.

[75]  Ezgi O. Booth,et al.  Epistasis analysis with global transcriptional phenotypes , 2005, Nature Genetics.

[76]  Blaz Zupan,et al.  TA-clustering: Cluster analysis of gene expression profiles through Temporal Abstractions , 2005, Int. J. Medical Informatics.

[77]  Christopher H. Bryant,et al.  Functional genomic hypothesis generation and experimentation by a robot scientist , 2004, Nature.

[78]  Hidde de Jong,et al.  Modeling and Simulation of Genetic Regulatory Systems: A Literature Review , 2002, J. Comput. Biol..

[79]  Michael Jünger,et al.  Graph Drawing Software , 2003, Graph Drawing Software.

[80]  Ian Witten,et al.  Data Mining , 2000 .

[81]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[82]  Rafal Kustra,et al.  Incorporating Gene Ontology in Clustering Gene Expression Data , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).