Protein Sequence Motif Information Generated by Fuzzy - Hybrid Hierarchical K-means Clustering Algorithm

Recurring amino acids sequence patterns are referred to as protein sequence motifs. The recurring patterns are so important because the conserved regions have the potential to reveal the role of the protein itself. In this paper, we modify the FGK model and apply the Hybrid Hierarchical K-means (HHK) clustering algorithm, which is a hybrid combination of Agglomerative Hierarchical Clustering and KMeans Clustering, instead of greedy K-means clustering algorithm to discover protein sequence motifs that transcend protein family boundaries. This dual algorithm requires no user-defined parameters to identify the similarities and dissimilarities between the protein sequences. After we analyze the motifs generated from the HHK algorithm, the results are not only significant in sequence area but also in secondary structure. We obtained more than 49% of the clusters share more than 60% secondary structure similarity and 14% of the clusters share more than 70% secondary structure similarity. By comparing with the previous work, which generates only 25% and 0% on 60% and 70% group, the newly proposed approach gives us a better understanding of the relationships between a set of sequences. We believe that the HHK-Means algorithm, along with the change to the FGK model, will generate better results than those have previously been shown.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Yi Pan,et al.  FIK Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[3]  Yi Pan,et al.  Super Granular SVM Feature Elimination (Super GSVM-FE) Model for Protein Sequence Motif Informnation Extraction , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[4]  Yi Pan,et al.  FGK MODEL : AN EFFICIENT GRANULAR COMPUTING MODEL FOR PROTEIN SEQUENCE MOTIFS INFORMATION DISCOVERY , 2006 .

[5]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[6]  Jianying Hu,et al.  Statistical methods for automated generation of service engagement staffing plans , 2007, IBM J. Res. Dev..

[7]  Tsau Young Lin,et al.  Data Mining and Machine Oriented Modeling: A Granular Computing Approach , 2000, Applied Intelligence.

[8]  Yi Pan,et al.  Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property , 2005, IEEE Transactions on NanoBioscience.

[9]  G. Crooks,et al.  WebLogo: A sequence logo generator, Genome Research, , 2004 .

[10]  Yiyu Yao,et al.  On modeling data mining with granular computing , 2001, 25th Annual International Computer Software and Applications Conference. COMPSAC 2001.

[11]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[12]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[13]  Yi Pan,et al.  Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model for protein sequence motif information extraction , 2008, Int. J. Funct. Informatics Pers. Medicine.

[14]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[15]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[16]  Yi Pan,et al.  Protein Sequence Motif Super-Rule-Tree (SRT) Structure Constructed by Hybrid Hierarchical K-Means Clustering Algorithm , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[17]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.