Gene expression modular analysis: an overview from the data mining perspective

In this review, we discuss the main problems and state‐of‐the‐art solutions applied to the field of gene expression. Specific data analysis workflows have been developed in parallel with the technology and currently cover a very wide spectrum of methods and applications needed to give answers to a lot of scientific questions that this type of data are producing. Computer science and, more specifically, the data mining area is still benefiting from a large set of real‐case scenarios to apply and develop new ideas and tools for discovering biological knowledge and new information from this experimental data. In this article, we make the reader aware of the main problems that still persist and provide a description of the methodologies that are applied for classification, clustering, and functional exploration of gene expression data. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 381–396 DOI: 10.1002/widm.29

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[3]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[6]  J. Carazo,et al.  GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists , 2007, Genome Biology.

[7]  G. Churchill Using ANOVA to analyze microarray data. , 2004, BioTechniques.

[8]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[9]  David Tuck,et al.  An Effective Tri-Clustering Algorithm Combining Expression Data with Gene Regulation Information , 2009, Gene regulation and systems biology.

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[12]  Chengyu Liu,et al.  Biclustering of gene expression data by non-smooth non-negative matrix factorization , 2010 .

[13]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[14]  Bruno Torrésani,et al.  Blind Source Separation and the Analysis of Microarray Data , 2004, J. Comput. Biol..

[15]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Stefan Kramer,et al.  Analyzing microarray data using quantitative association rules , 2005, ECCB/JBI.

[17]  Sorin Drăghici,et al.  Data Analysis Tools for DNA Microarrays , 2003 .

[18]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[19]  R. Stoughton Applications of DNA microarrays in biology. , 2005, Annual review of biochemistry.

[20]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[21]  Taesung Park,et al.  Evaluation of normalization methods for microarray data , 2003 .

[22]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[23]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[24]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[25]  Roded Sharan,et al.  Biclustering Algorithms: A Survey , 2007 .

[26]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[27]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[28]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[29]  Vincent F. Melfi,et al.  Microarray analysis of gene expression: considerations in data mining and statistical treatment. , 2006, Physiological genomics.

[30]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[31]  Francisco Tirado,et al.  GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information , 2009, Nucleic Acids Res..

[32]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[35]  Giorgio Valle,et al.  The Gene Ontology in 2010: extensions and refinements , 2009, Nucleic Acids Res..

[36]  Musa H. Asyali,et al.  Gene expression profile class prediction using linear Bayesian classifiers , 2007, Comput. Biol. Medicine.

[37]  Nadia Bolshakova,et al.  Estimating the Number of Clusters in DNA Microarray Data , 2006, Methods of Information in Medicine.

[38]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[39]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[40]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[41]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[42]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[43]  Jin Hwan Do,et al.  Clustering approaches to identifying gene expression patterns from DNA microarray data. , 2008, Molecules and cells.

[44]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45]  S. Amari,et al.  Nonnegative Matrix and Tensor Factorization [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[46]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[47]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[48]  Andrzej Cichocki,et al.  Fast Nonnegative Matrix Factorization Algorithms Using Projected Gradient Approaches for Large-Scale Problems , 2008, Comput. Intell. Neurosci..

[49]  Jan J. Gerbrands,et al.  On the relationships between SVD, KLT and PCA , 1981, Pattern Recognit..

[50]  Ernst Wit,et al.  Statistics for microarrays , 2004 .

[51]  M. Erlander,et al.  Molecular classification of unknown primary cancer. , 2009, Seminars in oncology.

[52]  Martin Vingron,et al.  Normalization and quantification of differential expression in gene expression microarrays , 2006, Briefings Bioinform..

[53]  Antoine M. van Oijen,et al.  Real-time single-molecule observation of rolling-circle DNA replication , 2009, Nucleic acids research.

[54]  Andrzej Cichocki,et al.  Advances in Nonnegative Matrix and Tensor Factorization , 2008, Comput. Intell. Neurosci..

[55]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[56]  Ahmed H. Tewfik,et al.  Early detection of ovarian cancer using group biomarkers , 2008, Molecular Cancer Therapeutics.

[57]  A. Malpertuy,et al.  Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments , 2010, BMC Genomics.