A semi-supervised fuzzy clustering algorithm applied to gene expression data

Over the last decade there has been an increasing interest in semi-supervised clustering. Several studies have suggested that even a small amount of supervised information can significantly improve the results of unsupervised learning. One popular method of incorporating partial supervised information is through pair-wise constraints indicating whether a certain pair of patterns should belong to the same (Must-link) or different (Dont-link) clusters. In this study we propose a novel semi-supervised fuzzy clustering algorithm (SSFCA). The supervised information is incorporated via a method quantifying Must-link and/or Dont-link constraints. Additionally, we present an extension of SSFCA that allows the algorithm to automatically detect the number of clusters in the data. We apply SSFCA to the intrinsic problem of gene expression profiles clustering. The advantageous properties of fuzzy logic, inherited to SSFCA, allow genes to belong to more than one group, revealing this way more profound information concerning their multiple functioning roles. Finally, we investigate the incorporation of prior biological knowledge arriving from Gene Ontology in the process of selecting pair-wise constraints. Simulations on artificial and real life datasets proved that the proposed SSFCA significantly outperformed other standard and semi-supervised clustering methods.

[1]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[3]  Nozha Boujemaa,et al.  Active semi-supervised fuzzy clustering , 2008, Pattern Recognit..

[4]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[5]  LiewAlan Wee-Chung,et al.  Pattern recognition techniques for the emerging field of bioinformatics , 2005 .

[6]  T. Van Le,et al.  Evolutionary fuzzy clustering , 1995 .

[7]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[8]  Ivan G. Costa,et al.  Analyzing gene expression time-courses , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[12]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[13]  James C. Bezdek,et al.  Fuzzy Kohonen clustering networks , 1994, Pattern Recognit..

[14]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[15]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[16]  Hong Liu,et al.  Evolutionary semi-supervised fuzzy clustering , 2003, Pattern Recognit. Lett..

[17]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[18]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[19]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[20]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[21]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[22]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[23]  Anastasios Bezerianos,et al.  Semi Supervised Fuzzy Clustering Networks for Constrained Analysis of Time-Series Gene Expression Data , 2006, ICANN.

[24]  Witold Pedrycz,et al.  Fuzzy clustering with supervision , 2004, Pattern Recognit..

[25]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..

[26]  Anastasios Bezerianos,et al.  An in silico method for detecting overlapping functional modules from composite biological networks , 2008, BMC Systems Biology.

[27]  James C. Bezdek,et al.  Partially supervised clustering for image segmentation , 1996, Pattern Recognit..

[28]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[29]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[30]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[31]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[32]  Witold Pedrycz,et al.  Fuzzy clustering with partial supervision , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[33]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..