Clustering Support Vector Machines and Its Application to Local Protein Tertiary Structure Prediction

Support Vector Machines (SVMs) are new generation of machine learning techniques and have shown strong generalization capability for many data mining tasks. SVMs can handle nonlinear classification by implicitly mapping input samples from the input feature space into another high dimensional feature space with a nonlinear kernel function. However, SVMs are not favorable for huge datasets with over millions of samples. Granular computing decomposes information in the form of some aggregates and solves the targeted problems in each granule. Therefore, we propose a novel computational model called Clustering Support Vector Machines (CSVMs) to deal with the complex classification problems for huge datasets. Taking advantage of both theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. This feature makes learning tasks for each CSVMs more specific and simpler. Moreover, CSVMs built particularly for each granule can be easily parallelized so that CSVMs can be used to handle huge datasets efficiently. The CSVMs model is used for predicting local protein tertiary structure. Compared with the conventional clustering method, the prediction accuracy for local protein tertiary structure has been improved noticeably when the new CSVM model is used. The encouraging experimental results indicate that our new computational model opens a new way to solve the complex classification for huge datasets.

[1]  Yi Pan,et al.  Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property , 2005, IEEE Transactions on NanoBioscience.

[2]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[4]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[5]  Chih-Jen Lin,et al.  Training nu-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Comput..

[6]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[7]  Yi Pan,et al.  Multiclass Fuzzy Clustering Support Vector Machines for Protein Local Structure Prediction , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[8]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[9]  Wei Zhong,et al.  Mutual Information based Minimum Spanning Trees Model for Selecting Discriminative Genes , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[10]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[11]  Yi Pan,et al.  Discovery of Local protein sequence motifs using Improved k-means Clustering Technique , 2005, Advances in Bioinformatics and Its Applications.

[12]  Yi Pan,et al.  Mining protein sequence motifs representing common 3D structures , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[13]  José L. Balcázar,et al.  Provably Fast Training Algorithms for Support Vector Machines , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Yiyu Yao,et al.  Perspectives of granular computing , 2005, 2005 IEEE International Conference on Granular Computing.

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[17]  D. Baker,et al.  Prediction of local structure in proteins using a library of sequence-structure motifs. , 1998, Journal of molecular biology.

[18]  V. Pande,et al.  How does averaging affect protein structure comparison on the ensemble level? , 2004, Biophysical journal.

[19]  Yiyu Yao,et al.  Granular Computing , 2008 .

[20]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[21]  Yi Pan,et al.  Factoring tertiary classification into binary classification improves neural network for protein secondary structure prediction , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[22]  Giorgio Valentini,et al.  Low Bias Bagged Support Vector Machines , 2003, ICML.

[23]  Hae-Jin Hu,et al.  Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier , 2004, IEEE Transactions on NanoBioscience.

[24]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[25]  Latifur Khan,et al.  An effective support vector machines (SVMs) performance using hierarchical clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[26]  Nathan Linial,et al.  Approximate protein structural alignment in polynomial time. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  S. Vavasis Nonlinear optimization: complexity issues , 1991 .

[28]  Deepak K. Agarwal,et al.  Shrinkage estimator generalizations of Proximal Support Vector Machines , 2002, KDD.

[29]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[30]  P.C. Tai,et al.  Parallel protein secondary structure prediction based on neural networks , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[31]  Daniel Boley,et al.  Training Support Vector Machines Using Adaptive Clustering , 2004, SDM.

[32]  Yi Pan,et al.  Parallel protein secondary structure prediction schemes using Pthread and OpenMP over hyper-threading technology , 2007, The Journal of Supercomputing.

[33]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.

[34]  Vasudha Bhatnagar,et al.  K-means Clustering Algorithm for Categorical Attributes , 1999, DaWaK.