Protein sequence analysis using relational soft clustering algorithms

To recognize functional sites within a protein sequence, the non-numerical attributes of the sequence need encoding prior to using a pattern recognition algorithm. The success of recognition depends on the efficient coding of the biological information contained in the sequence. In this regard, a bio-basis function maps a non-numerical sequence space to a numerical feature space, based on an amino acid mutation matrix. In effect, the biological content in a sequence can be maximally utilized for analysis. One of the important issues for the bio-basis function is how to select a minimum set of bio-bases with maximum information. In this paper, we present two relational soft clustering algorithms, named rough c-medoids and fuzzy-possibilistic c-medoids, to select the most informative bio-bases. While both fuzzy and possibilistic memberships of fuzzy-possibilistic c-medoids avoid the noise sensitivity defect of fuzzy c-medoids and the coincident clusters problem of possibilistic c-medoids, the concept of lower and upper boundaries of rough c-medoids deals with uncertainty, vagueness, and incompleteness in class definition of biological data. The concept of ‘degree of resemblance’, based on non-gapped pairwise homology alignment score, circumvents the initialization and local minima problems of both c-medoids algorithms. In effect, it enables efficient selection of a minimum set of most informative bio-bases. The effectiveness of the algorithms, along with a comparison with other algorithms, has been demonstrated on HIV (human immunodeficiency virus) protein datasets.

[1]  J. Chermann,et al.  Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). , 1983, Science.

[2]  K. Chou,et al.  A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. , 1993, The Journal of biological chemistry.

[3]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[4]  M. Roubens Pattern classification problems and fuzzy sets , 1978 .

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[7]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[8]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[9]  Puteh Saad,et al.  Rough Set on Trademark Images for Neural Network Classifier , 2002, Int. J. Comput. Math..

[10]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[11]  James C. Bezdek,et al.  Nerf c-means: Non-Euclidean relational fuzzy clustering , 1994, Pattern Recognit..

[12]  Zheng Rong Yang,et al.  Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks , 2005, Bioinform..

[13]  Zheng Rong Yang,et al.  Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms , 2004, Comput. Biol. Chem..

[14]  M. P. Windham Numerical classification of proximity data with assignment measures , 1985 .

[15]  Kuo-Chen Chou,et al.  Predicting the linkage sites in glycoproteins using bio-basis function neural network , 2004, Bioinform..

[16]  William R. Taylor,et al.  A structural model for the retroviral proteases , 1987, Nature.

[17]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[18]  Hsuan-Shih Lee,et al.  Fuzzy forecasting based on fuzzy time series , 2004, Int. J. Comput. Math..

[19]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[20]  D. Dubois,et al.  ROUGH FUZZY SETS AND FUZZY ROUGH SETS , 1990 .

[21]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[22]  Sankar K. Pal,et al.  Rough fuzzy MLP: knowledge encoding and classification , 1998, IEEE Trans. Neural Networks.

[23]  Zheng Rong Yang,et al.  Characterizing proteolytic cleavage site activity using bio-basis function neural networks , 2003, Bioinform..

[24]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[25]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Andrzej Skowron,et al.  Approximation Spaces and Information Granulation , 2004, Trans. Rough Sets.

[27]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[28]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[29]  A Wlodawer,et al.  Structure of complex of synthetic HIV-1 protease with a substrate-based inhibitor at 2.3 A resolution. , 1989, Science.

[30]  Zheng Rong Yang,et al.  Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.