A block-based support vector machine approach to the protein homology prediction task in KDD Cup 2004

This paper describes our solution for the protein homology prediction task in KDD Cup 2004 competition. This task is modeled as a supervised learning problem with multiple performance metrics. Several key characteristics make the problem both novel and challenging, including the concept of data blocks and the presence of large-scale and imbalanced training data. These features make a naive application of the traditional classification algorithms infeasible. Our approach focuses on making full use of the abundant information within the blocks, and developing a new technique for reducing and balancing training data to make the support vector machine applicable to this kind of large-scale and imbalanced learning tasks.

[1]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[2]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Y. Lacasse,et al.  From the authors , 2005, European Respiratory Journal.

[6]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[7]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[8]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[9]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .