论文信息 - Protein sequence analysis using relational soft clustering algorithms

Protein sequence analysis using relational soft clustering algorithms

To recognize functional sites within a protein sequence, the non-numerical attributes of the sequence need encoding prior to using a pattern recognition algorithm. The success of recognition depends on the efficient coding of the biological information contained in the sequence. In this regard, a bio-basis function maps a non-numerical sequence space to a numerical feature space, based on an amino acid mutation matrix. In effect, the biological content in a sequence can be maximally utilized for analysis. One of the important issues for the bio-basis function is how to select a minimum set of bio-bases with maximum information. In this paper, we present two relational soft clustering algorithms, named rough c-medoids and fuzzy-possibilistic c-medoids, to select the most informative bio-bases. While both fuzzy and possibilistic memberships of fuzzy-possibilistic c-medoids avoid the noise sensitivity defect of fuzzy c-medoids and the coincident clusters problem of possibilistic c-medoids, the concept of lower and upper boundaries of rough c-medoids deals with uncertainty, vagueness, and incompleteness in class definition of biological data. The concept of ‘degree of resemblance’, based on non-gapped pairwise homology alignment score, circumvents the initialization and local minima problems of both c-medoids algorithms. In effect, it enables efficient selection of a minimum set of most informative bio-bases. The effectiveness of the algorithms, along with a comparison with other algorithms, has been demonstrated on HIV (human immunodeficiency virus) protein datasets.

Sankar K. Pal | Pradipta Maji | S. Pal | P. Maji | S. Pal

[1] J. Chermann,et al. Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). , 1983, Science.

[2] K. Chou,et al. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. , 1993, The Journal of biological chemistry.

[3] Anupam Joshi,et al. Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[4] M. Roubens. Pattern classification problems and fuzzy sets , 1978 .

[5] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[6] M. O. Dayhoff. A model of evolutionary change in protein , 1978 .

[7] S. Altschul,et al. Issues in searching molecular sequence databases , 1994, Nature Genetics.

[8] John P. Overington,et al. A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[9] Puteh Saad,et al. Rough Set on Trademark Images for Neural Network Classifier , 2002, Int. J. Comput. Math..

[10] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[11] James C. Bezdek,et al. Nerf c-means: Non-Euclidean relational fuzzy clustering , 1994, Pattern Recognit..

[12] Zheng Rong Yang,et al. Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks , 2005, Bioinform..

[13] Zheng Rong Yang,et al. Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms , 2004, Comput. Biol. Chem..

[14] M. P. Windham. Numerical classification of proximity data with assignment measures , 1985 .

[15] Kuo-Chen Chou,et al. Predicting the linkage sites in glycoproteins using bio-basis function neural network , 2004, Bioinform..

[16] William R. Taylor,et al. A structural model for the retroviral proteases , 1987, Nature.

[17] Peter J. Rousseeuw,et al. Clustering by means of medoids , 1987 .

[18] Hsuan-Shih Lee,et al. Fuzzy forecasting based on fuzzy time series , 2004, Int. J. Comput. Math..

[19] T. Sejnowski,et al. Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[20] D. Dubois,et al. ROUGH FUZZY SETS AND FUZZY ROUGH SETS , 1990 .

[21] Zheng Rong Yang,et al. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[22] Sankar K. Pal,et al. Rough fuzzy MLP: knowledge encoding and classification , 1998, IEEE Trans. Neural Networks.

[23] Zheng Rong Yang,et al. Characterizing proteolytic cleavage site activity using bio-basis function neural networks , 2003, Bioinform..

[24] Enrique H. Ruspini,et al. Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[25] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26] Andrzej Skowron,et al. Approximation Spaces and Information Granulation , 2004, Trans. Rough Sets.

[27] K. Chou. Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[28] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[29] A Wlodawer,et al. Structure of complex of synthetic HIV-1 protease with a substrate-based inhibitor at 2.3 A resolution. , 1989, Science.

[30] Zheng Rong Yang,et al. Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.