Cross-validation (CV), which is widely used in classification problems, gives a very good estimate of the prediction accuracy of a classifier over unseen data. Thus, any improvement in the accuracy estimation of the cross-validation method will benefit a lot of people and help in improving the results of many researches. In this paper the focus is on skewed noisy datasets. Applications such as fraud detection is an important example of skewed data. Usually for CV, simple random sampling (SRS) is performed to divide the data into the required number of folds, e.g., 10-fold CV requires the data to be divided into 10 folds. SRS is known to give poor performance (accuracy of classification) when data is skewed. We propose a new algorithm, based on the frequency histogram of each attribute value, to divide the dataset into the required number of folds. In this project, the effectiveness of the proposed algorithm vis-a-vis SRS is tested with datasets from the UCI machine learning repository. The results show that the proposed algorithm is significantly better in handling noisy skewed data.
[1]
Xuelong Li,et al.
Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval
,
2006,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2]
Nitesh V. Chawla,et al.
Editorial: special issue on learning from imbalanced data sets
,
2004,
SKDD.
[3]
Tom Fawcett,et al.
Robust Classification for Imprecise Environments
,
2000,
Machine Learning.
[4]
Xindong Wu,et al.
Eliminating Class Noise in Large Datasets
,
2003,
ICML.
[5]
Carla E. Brodley,et al.
Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data
,
1996,
IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.
[6]
Ron Kohavi,et al.
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
,
1995,
IJCAI.
[7]
Carla E. Brodley,et al.
Identifying Mislabeled Training Data
,
1999,
J. Artif. Intell. Res..