Integration of cluster ensemble and text summarization for gene expression analysis

Generating high quality gene clusters and identifying the underlying biological mechanism of the gene cluster are the important goals of clustering gene expression analysis. To get high quality cluster results, most of the current approaches rely on choosing the best cluster algorithm whose design biases and assumptions meet the underlying distribution of the data set. There are two issues for this approach: (1) usually the underlying data distribution of the gene expression data sets is unknown, and (2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. But cluster quality and cluster interpretation are closely related and must be addressed in a coherent and unified way. It is essential to have relatively high quality clusters first, in order to get a correct, informative biological explanation of the gene cluster, otherwise, the biological explanation will be incorrect or misleading, no matter how good or robust the text summarization technique is. Based on this consideration, we design and develop a unified system GE-Miner (gene expression miner) to address these challenging issues in a principled and general manner by integrating cluster ensemble and text summarization and provide an environment for comprehensive gene expression data analysis. Experimental results demonstrate that our system can obtain high quality clusters and provide concise and informative textual summary for the gene clusters.

[1]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[2]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[3]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[4]  Min Song,et al.  KPSpotter: a flexible information gain-based keyphrase extraction system , 2003, WIDM '03.

[5]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[6]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[7]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[8]  Abdelghani Bellaachia,et al.  E-CAST: A Data Mining Algorithm for Gene Expression Data , 2002, BIOKDD.

[9]  Guang R. Gao,et al.  An adaptive meta-clustering approach: combining the information from different clustering results , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[10]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[11]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[12]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[13]  Peter Schäuble,et al.  Using the Co-occurrence of Words for Retrieval Weighting , 2000, Information Retrieval.

[14]  Francisco Azuaje,et al.  Clustering Genomic Expression Data: Design and Evaluation Principles , 2003 .

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[17]  Xiaohua Hu,et al.  SCALABLE LEARNING METHOD TO EXTRACT BIOLOGICAL INFORMATION FROM HUGE ONLINE BIOMEDICAL LITERATURE , 2004 .

[18]  Michael E. Cusick,et al.  The Yeast Proteome Database (YPD) and Caenorhabditis elegans Proteome Database (WormPD): comprehensive resources for the organization and comparison of model organism protein information , 2000, Nucleic Acids Res..

[19]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Xiaohua Hu,et al.  Discovering cyber communities from the WWW , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[22]  Xiaohua Hu,et al.  Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  K. Murali,et al.  MedMeSH Summarizer: Text Mining for Gene Clusters , 2002, SDM.

[24]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[25]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[26]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[27]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[28]  D. Lockhart,et al.  Functional Genomics , 1999, Springer Netherlands.

[29]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[30]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[31]  George Karypis,et al.  Clustering in life sciences. , 2003, Methods in molecular biology.

[32]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..