论文信息 - The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification

In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. We quantify the complexity of a biomedical dataset using five complexity measures: correlation-based feature selection subset merit, noise, imbalance ratio, missing values and information gain. The effect of these complexity measures on classification accuracy is evaluated using five diverse machine learning algorithms: J48 (decision tree), SMO (support vector machines), Naive Bayes (probabilistic), IBk (instance based learner) and JRIP (rule-based induction). The results of our experiments show that noise and correlation-based feature selection subset merit --- not a particular choice of algorithm --- play a major role in determining the classification accuracy. In the end, we provide researchers with a meta-model and an empirical equation to estimate the classification potential of a dataset on the basis of its complexity. This well help researchers to efficiently pre-process the dataset for automatic knowledge extraction.

Muddassar Farooq | Ajay Kumar Tanwani

[1] Ian Witten,et al. Data Mining , 2000 .

[2] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[3] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4] Carla E. Brodley,et al. Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[5] Tin Kam Ho,et al. Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Muhammad Zubair Shafiq,et al. Guidelines to Select Machine Learning Scheme for Classification of Biomedical Datasets , 2009, EvoBIO.