Labeled data for classification could often be obtained by sampling that restricts or favors choice of certain classes. A classifier trained using such data will be biased, resulting in wrong inference and sub-optimal classification on new data. Given an unlabeled new data set we propose a bootstrap method to estimate its class probabilities by using an estimate of the classifier's accuracy on training data and an estimate of probabilities of classifier's predictions on new data. Then, we propose two methods to improve classification accuracy on new data. The first method can be applied only if a classifier was designed to predict posterior class probabilities where predictions of an existing classifier are adjusted according to the estimated class probabilities of new data. The second method can be applied to an arbitrary classification algorithm, but it requires retraining on the properly resampled data. The proposed bootstrap algorithm was validated through experiments with 500 replicates calculated on 1,000 realizations for each of 16 choices of data set size, number of classes, prior class probabilities and conditional probabilities describing a classifier's performance. Applications of the proposed methodology to a benchmark data set with various class probabilities on unlabeled data and balanced class probabilities on the training data provided strong evidence that the proposed methodology can be successfully used to significantly improve classification on unlabeled data.
[1]
Martin A. Riedmiller,et al.
A direct adaptive method for faster backpropagation learning: the RPROP algorithm
,
1993,
IEEE International Conference on Neural Networks.
[2]
P. Romero,et al.
Sequence complexity of disordered protein
,
2001,
Proteins.
[3]
Zoran Obradovic,et al.
Performance Controlled Data Reduction for Knowledge Discovery in Distributed Databases
,
2000,
PAKDD.
[4]
Robert Tibshirani,et al.
An Introduction to the Bootstrap
,
1994
.
[5]
Zoran Obradovic,et al.
Methods for improving protein disorder prediction
,
2001,
IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).
[6]
Leo Breiman,et al.
Classification and Regression Trees
,
1984
.
[7]
Pedro M. Domingos.
MetaCost: a general method for making classifiers cost-sensitive
,
1999,
KDD '99.
[8]
V. Uversky.
Intrinsically Disordered Proteins
,
2000
.
[9]
Bojan Cestnik,et al.
Estimating Probabilities: A Crucial Task in Machine Learning
,
1990,
ECAI.
[10]
Thomas G. Dietterich,et al.
Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers
,
2000,
ICML.
[11]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[12]
Richard O. Duda,et al.
Pattern classification and scene analysis
,
1974,
A Wiley-Interscience publication.
[13]
Leo Breiman,et al.
Bagging Predictors
,
1996,
Machine Learning.