Cluster identification and separation in the growing self-organizing map: application in protein sequence classification

Growing self-organizing map (GSOM) has been introduced as an improvement to the self-organizing map (SOM) algorithm in clustering and knowledge discovery. Unlike the traditional SOM, GSOM has a dynamic structure which allows nodes to grow reflecting the knowledge discovered from the input data as learning progresses. The spread factor parameter (SF) in GSOM can be utilized to control the spread of the map, thus giving an analyst a flexibility to examine the clusters at different granularities. Although GSOM has been applied in various areas and has been proven effective in knowledge discovery tasks, no comprehensive study has been done on the effect of the spread factor parameter value to the cluster formation and separation. Therefore, the aim of this paper is to investigate the effect of the spread factor value towards cluster separation in the GSOM. We used simple k-means algorithm as a method to identify clusters in the GSOM. By using Davies–Bouldin index, clusters formed by different values of spread factor are obtained and the resulting clusters are analyzed. In this work, we show that clusters can be more separated when the spread factor value is increased. Hierarchical clusters can then be constructed by mapping the GSOM clusters at different spread factor values.

[1]  Huiru Zheng,et al.  Poisson approach to clustering analysis of regulatory sequences , 2008, Int. J. Comput. Biol. Drug Des..

[2]  A. Hsu,et al.  Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing , 2007, Journal of biomedicine & biotechnology.

[3]  Cathy H. Wu,et al.  Protein classification using a neural network database system , 1991, ANNA '91.

[4]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[5]  Damminda Alahakoon,et al.  Controlling the spread of dynamic self-organising maps , 2004, Neural Computing & Applications.

[6]  E A Ferrán,et al.  Self‐organized neural maps of human protein sequences , 1994, Protein science : a publication of the Protein Society.

[7]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[8]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[9]  Risto Miikkulainen,et al.  Incremental grid growing: encoding high-dimensional structure into a two-dimensional feature map , 1993, IEEE International Conference on Neural Networks.

[10]  Haiying Wang,et al.  An integrative and interactive framework for improving biomedical pattern discovery and visualization , 2004, IEEE Transactions on Information Technology in Biomedicine.

[11]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[12]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[13]  Cathy H. Wu,et al.  Neural networks and genome informatics , 2000 .

[14]  Miguel A. Andrade-Navarro,et al.  Classification of protein families and detection of the determinant residues with an improved self-organizing map , 1997, Biological Cybernetics.

[15]  Edgardo A. Ferrán,et al.  Topological maps of protein sequences , 2004, Biological Cybernetics.

[16]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[17]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[18]  Kate Smith-Miles,et al.  HDGSOM: a modified growing self-organizing map for high dimensional data clustering , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[19]  Bala Srinivasan,et al.  Automatic Clustering and Rule Extraction using a Dynamic SOM Tree , 2000 .

[20]  Joaquin Dopazo,et al.  Self‐organizing tree‐growing network for the classification of protein sequences , 1998, Protein science : a publication of the Protein Society.

[21]  N R Kallenbach,et al.  Proteolysis as a measure of the free energy difference between cytochrome c and its derivatives , 1998, Protein science : a publication of the Protein Society.

[22]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[24]  Rasika Amarasiri,et al.  Enhanced Cluster Visualization Using the Data Skeleton Model , 2003 .

[25]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[26]  Saman K. Halgamuge,et al.  An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data , 2003, Bioinform..

[27]  Huiru Zheng,et al.  Improving Pattern Discovery and Visualization of SAGE Data Through Poisson-Based Self-Adaptive Neural Networks , 2008, IEEE Transactions on Information Technology in Biomedicine.

[28]  Weidi Dai,et al.  Document Clustering Algorithm Based on Tree-Structured Growing Self-Organizing Feature Map , 2004, ISNN.

[29]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.