论文信息 - Dealing with Unknown Priors in Supervised Classification

Dealing with Unknown Priors in Supervised Classification

In this work, we examine minimum expected error-rate and minimum expected cost decision making in the presence of uncertainty about the class priors. More precisely, we train a classifier on a training set, and, once the parameters are estimated, we apply this classifier on new real-world data that have to be labeled. We thus examine the situation in which the a priori probabilities of the classes (priors) in the real-world data set are unknown and are suspected to be different from those encountered in the training set, while the within-class densities remain unchanged (new sampling conditions). This problem is known as the “unbalanced data set problem” in the machine learning community. Various scenarios are considered: (1) the priors of the training set and the real-world data set are the same (simple Bayesian decision making), (2) the priors of the training set and the real-world data set are different but we know these new priors, (3) the priors of the training set and the real-world data set are different, we do not have any knowledge about these new priors, but we can estimate them on the new data set, (4) the priors of the training set and the real-world data set are different and we do not have access to the real-world data set, so that no estimate of the priors can be computed. All these cases are discussed from a decision-making point of view, aiming to optimize the classification results in the new sampling conditions. In particular, we show that when no information at all is available about the sampling conditions (the priors) on which the classification model will be applied, the optimal decision rule is based on the likelihood, that is, equal priors for all classes. This justifies the “rule of thumb” that is usually applied in this situation: to train the classifier with equal proportions of observations from each class. Marco Saerens (the corresponding author) and Nathalie Souchon are with the ISYS Unit (Information Systems Research Unit), IAG, Universite catholique de Louvain, Place des Doyens 1, B-1348 Louvain-laNeuve, Belgium. Email: {saerens, souchon}@isys.ucl.ac.be. Jean-Michel Renders is with the Xerox Research Center Europe, Chemin deMaupertuis 6, 38240 Meylan (Grenoble), France. Email: jean-michel.renders@xrce.xerox.com. Christine Decaestecker is a Senior Research Assistant of the F.N.R.S. and is with the Laboratory of Toxicology, Institute of Pharmacy, Universite Libre de Bruxelles, Campus Plaine CP 205/1, Boulevard du Triomphe, B-1050 Bruxelles, Belgium. Email: cdecaes@ulb.ac.be.

Marco Saerens | Jean-Michel Renders | Christine Decaestecker | Nathalie Souchon