Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets

In this report, I presented my results to the tasks of 2008 UC San Diego Data Mining Contest. This contest consists of two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. For task 1, I investigated several re-sampling techniques in improving the learning from the imbalanced data. These include SMOTE (Synthetic Minority Over-sampling Technique), Oversampling by duplicating minority examples, random undersampling. These techniques were used to create new balanced training data sets. Then three standard classifiers (Decision Tree, Naive Bayes, Neural Network) were trained on the rebalanced training sets and used to classify the test set. The results showed the re-sampling techniques significantly improve the accuracy on the test set except for the Naive Bayes classifier. For task 2, I implemented twostep strategy algorithm to learn a classifier from the only positive and unlabeled data. In step 1, I implemented Spy technique to extract reliable negative (RN) examples. In step 2, I then used the labeled positive examples and the reliable negative examples as training set to learn standard Naive Bayes classifier. The results showed the two-step algorithm significantly improves the F1 score compared to the learning that simply regards unlabeled examples as negative ones.