In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. We quantify the complexity of a biomedical dataset using five complexity measures: correlation-based feature selection subset merit, noise, imbalance ratio, missing values and information gain. The effect of these complexity measures on classification accuracy is evaluated using five diverse machine learning algorithms: J48 (decision tree), SMO (support vector machines), Naive Bayes (probabilistic), IBk (instance based learner) and JRIP (rule-based induction). The results of our experiments show that noise and correlation-based feature selection subset merit --- not a particular choice of algorithm --- play a major role in determining the classification accuracy. In the end, we provide researchers with a meta-model and an empirical equation to estimate the classification potential of a dataset on the basis of its complexity. This well help researchers to efficiently pre-process the dataset for automatic knowledge extraction.
[1]
Ian Witten,et al.
Data Mining
,
2000
.
[2]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[3]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques, 3rd Edition
,
1999
.
[4]
Carla E. Brodley,et al.
Identifying Mislabeled Training Data
,
1999,
J. Artif. Intell. Res..
[5]
Tin Kam Ho,et al.
Complexity Measures of Supervised Classification Problems
,
2002,
IEEE Trans. Pattern Anal. Mach. Intell..
[6]
Muhammad Zubair Shafiq,et al.
Guidelines to Select Machine Learning Scheme for Classification of Biomedical Datasets
,
2009,
EvoBIO.