Microarray Gene Cluster Identification and Annotation Through Cluster Ensemble and EM-Based Informative Textual Summarization

Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.

[1]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[2]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[3]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[4]  Karl Jost,et al.  Zipl George K. The Psycho-Biology of Language. Boston, Houghton Mifflin Company 1935. 336 S. 4° , 1937 .

[5]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[6]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Michael E. Cusick,et al.  The Yeast Proteome Database (YPD) and Caenorhabditis elegans Proteome Database (WormPD): comprehensive resources for the organization and comparison of model organism protein information , 2000, Nucleic Acids Res..

[9]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  K. Murali,et al.  MedMeSH Summarizer: Text Mining for Gene Clusters , 2002, SDM.

[12]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[13]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[14]  Guang R. Gao,et al.  An adaptive meta-clustering approach: combining the information from different clustering results , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[15]  Peter Schäuble,et al.  Using the Co-occurrence of Words for Retrieval Weighting , 2000, Information Retrieval.

[16]  Francisco Azuaje,et al.  Clustering Genomic Expression Data: Design and Evaluation Principles , 2003 .

[17]  Min Song,et al.  KPSpotter: a flexible information gain-based keyphrase extraction system , 2003, WIDM '03.

[18]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[19]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[20]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[21]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[22]  George Karypis,et al.  Clustering in life sciences. , 2003, Methods in molecular biology.

[23]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[24]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[25]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[26]  Xiaohua Hu,et al.  Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[27]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[28]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.