A study of the bipartite ranking problem in machine learning
暂无分享,去创建一个
The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained attention in machine learning. A particular setting of interest is the bipartite ranking problem, in which instances come from two categories, positive and negative; the learner is given examples of instances labeled as positive or negative, and the goal is to learn from these examples a ranking function that ranks future positive instances higher than negative ones. This thesis makes four important contributions to the understanding of the bipartite ranking problem.
First, we derive large deviation and uniform convergence bounds for the bipartite ranking error, a quantity used to measure the quality of a bipartite ranking function. These bounds serve to bound the expected error of a ranking function in terms of its empirical error on a finite sample. The uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients.
Second, we define a model of learnability for bipartite ranking functions, and derive both upper and lower bounds on the sample complexity of learning ranking functions in this model. The upper bound is expressed in terms of the bipartite rank-shatter coefficients. The lower bound is expressed in terms of another new combinatorial parameter that we term the rank dimension .
Third, we derive generalization bounds for bipartite ranking algorithms based on the notion of algorithmic stability. Unlike bounds based on uniform convergence, these bounds can be applied also to algorithms that search function classes of unbounded complexity. In particular, we are able to apply our results to obtain generalization bounds for kernel-based ranking algorithms, to which bounds based on uniform convergence are often not applicable.
Finally, we demonstrate a practical application of bipartite ranking to a problem in bioinformatics, namely the problem of identifying genes related to a given disease based on microarray data. Our studies on leukemia and colon cancer data sets show very promising results, including the identification of some exciting candidate genes as potential targets for drug development.