Rough-Fuzzy C-Medoids Algorithm and Selection of Bio-Basis for Amino Acid Sequence Analysis

In most pattern recognition algorithms, amino acids cannot be used directly as inputs since they are nonnumerical variables. They, therefore, need encoding prior to input. In this regard, bio-basis function maps a nonnumerical sequence space to a numerical feature space. It is designed using an amino acid mutation matrix. One of the important issues for the bio-basis function is how to select the minimum set of bio-bases with maximum information. In this paper, we describe an algorithm, termed as rough-fuzzy c{\hbox{-}}{\rm{medoids}} (RFCMdd) algorithm, to select the most informative bio-bases. It is comprised of a judicious integration of the principles of rough sets, fuzzy sets, the c{\hbox{-}}{\rm{medoids}} algorithm, and the amino acid mutation matrix. While the membership function of fuzzy sets enables efficient handling of overlapping partitions, the concept of lower and upper bounds of rough sets deals with uncertainty, vagueness, and incompleteness in class definition. The concept of crisp lower bound and fuzzy boundary of a class, introduced in RFCMdd, enables efficient selection of the minimum set of the most informative bio-bases. Some new indices are introduced for evaluating quantitatively the quality of selected bio-bases. The effectiveness of the proposed algorithm, along with a comparison with other algorithms, has been demonstrated on different types of protein data sets.

[1]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[2]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[3]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Dubois,et al.  ROUGH FUZZY SETS AND FUZZY ROUGH SETS , 1990 .

[5]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[6]  Shusaku Tsumoto,et al.  An Indiscernibility-Based Clustering Method with Iterative Refinement of Equivalence Relations -Rough Clustering- , 2003, Journal of Advanced Computational Intelligence and Intelligent Informatics.

[7]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[8]  Sankar K. Pal,et al.  Rough Self Organizing Map , 2004, Applied Intelligence.

[9]  Sankar K. Pal,et al.  Multispectral image segmentation using the rough-set-initialized EM algorithm , 2002, IEEE Trans. Geosci. Remote. Sens..

[10]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[11]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[12]  M. Narasimha Murty,et al.  Rough support vector clustering , 2005, Pattern Recognit..

[13]  A Wlodawer,et al.  Structure of complex of synthetic HIV-1 protease with a substrate-based inhibitor at 2.3 A resolution. , 1989, Science.

[14]  Zheng Rong Yang,et al.  Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.

[15]  K. Chou,et al.  A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. , 1993, The Journal of biological chemistry.

[16]  Andrzej Skowron,et al.  Approximation Spaces and Information Granulation , 2004, Trans. Rough Sets.

[17]  Sankar K. Pal,et al.  Rough-Fuzzy MLP: Modular Evolution, Rule Generation, and Evaluation , 2003, IEEE Trans. Knowl. Data Eng..

[18]  William R. Taylor,et al.  A structural model for the retroviral proteases , 1987, Nature.

[19]  K C Chou,et al.  Artificial neural network model for predicting HIV protease cleavage sites in protein , 1998 .

[20]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[21]  Pawan Lingras,et al.  Interval Set Clustering of Web Users with Rough K-Means , 2004, Journal of Intelligent Information Systems.

[22]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[23]  Zheng Rong Yang,et al.  Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms , 2004, Comput. Biol. Chem..

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[26]  Supriya Kumar De A Rough Set Theoretic Approach to Clustering , 2004, Fundam. Informaticae.

[27]  J. Oxford,et al.  Caspase activation independent of cell death is required for proper cell dispersal and correct morphology in PC12 cells. , 2004, Experimental cell research.

[28]  Sankar K. Pal,et al.  Rough fuzzy MLP: knowledge encoding and classification , 1998, IEEE Trans. Neural Networks.