Block-Based Approaches to Learning Ranking Functions with Application to Protein Homology Prediction

In many information retrieval systems such as Web search engines and biological-sequence search engines, the ranking functions that list the search results in order of their relevances to the query are one of the most important components. In the machine learning approaches to constructing ranking-functions, the feature vectors of database items are computed based on queries and thus they are grouped into blocks by queries. However, few existing algorithms take into account the block structure of data when learning a ranking function. This paper describes a series of approaches for more accurate learning of ranking functions by exploiting the block structure of data and applies these approaches to the protein homology prediction problem, a key step of protein structure prediction in bioinformatics. These approaches range from data normalization and reduction to learner training. The data reduction methods, including block selection and support vector under-sampling, contributed to our original winning of the protein homology prediction task in the ACM KDDCUP-2004 competition. By extending the block-selection method to a query-adaptive version and using an ensemble-learning approach, a novel ranking-function learning algorithm named K-Nearest-Block (KNB) Ensemble Ranking is proposed. In this algorithm, given the data block derived from a new query, only those most similar data blocks in training data are used to learn a ranking function. Experiments with the support vector machine (SVM) used as the benchmark learner demonstrate that all the proposed block-based approaches can significantly improve the ranking performance of SVMs. Especially, the KNB SVM ensemble performs so far most accurately overall on the blinded test data set of the KDDCUP-2004 protein homology prediction problem.

[1]  Wei Chu,et al.  New approaches to support vector ordinal regression , 2005, ICML.

[2]  Martin Scholz,et al.  KDD-Cup 2004: protein homology task , 2004, SKDD.

[3]  Ron Elber,et al.  Enriching the sequence substitution matrix by structural information , 2003, Proteins.

[4]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[5]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[8]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[9]  Jianfeng Gao,et al.  Linear discriminant model for information retrieval , 2005, SIGIR '05.

[10]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[11]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[12]  Bernhard Pfahringer,et al.  The Weka solution to the 2004 KDD Cup , 2004, SKDD.

[13]  Edward Y. Chang,et al.  Aligning boundary in kernel space for learning imbalanced dataset , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[15]  Yanqing Zhang,et al.  Granular support vector machines with association rules mining for protein homology prediction , 2005, Artif. Intell. Medicine.

[16]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[17]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[20]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[21]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[22]  Jiawei Han,et al.  Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing , 2005, Data Mining and Knowledge Discovery.

[23]  Stefan Lessmann,et al.  Solving Imbalanced Classification Problems with Support Vector Machines , 2004, IC-AI.

[24]  Wen Gao,et al.  A block-based support vector machine approach to the protein homology prediction task in KDD Cup 2004 , 2004, SKDD.

[25]  Koby Crammer,et al.  Pranking with Ranking , 2001, NIPS.

[26]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[27]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[28]  Gerhard Widmer,et al.  Prediction of Ordinal Classes Using Regression Trees , 2001, Fundam. Informaticae.

[29]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[30]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[31]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[32]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[33]  Thorsten Joachims,et al.  KDD-Cup 2004: results and analysis , 2004, SKDD.

[34]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.