On the Generation of Accurate Predictive Model from Highly Imbalanced Data with Heuristics and Replication Techniques

Recent advancement in the field of life science data mining has inspired researchers and healthcare professionals to apply this novel technology to obtain descriptive patterns and predictive models from biomedical and healthcare databases. The discovery of hidden biomedical patterns from large clinical database can uncover potential knowledge to support prognosis and diagnosis decision makings. However, clinical application of data mining algorithms has a severe problem of low predictive accuracy rate that hampers their wide usage in the clinical environment. We thus focus our study on the improvement of predictive accuracy of the models created from the data mining algorithms. Our main research interest concerns the problem of learning a classification model from a multiclass data set with low prevalence rate of some minority classes. With such data characteristics, directly applying classification data mining techniques such as decision tree induction, regression analysis, neural networks, or support vector machines yields a suboptimal model in terms of predictive accuracy rate. To remedy the imbalanced class distribution among data instances, we apply random over-sampling and synthetic minority over-sampling (SMOTE) techniques to increase the predictive performance of the learned model. In our preliminary study, we consider specific kinds of primary tumors occurring at the frequency rate less than one percent as rare and minority classes. From the experimental results, the SMOTE technique gave a high specificity model, whereas the random over-sampling produced a high sensitivity classifier. The precision performance of a classification model obtained from the random over-sampling technique is on average much better than the model learned from the original imbalanced data set. We then extend our study by designing the heuristic based method to cope with the abundance of irrelevant feature that causes the decrease in learning time and sometimes lower the accuracy rate. The over-sampling technique and the heuristic-based feature selection are coupled as a data preparation method to deal with imbalanced data sets with many irrelevant features. The experimental results on arrhythmia and communities-and-crime data sets show significant improvement on the predicting accuracy, specificity, and sensitivity of the induced models.

[1]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[2]  Peter I. Cowling,et al.  Knowledge and Information Systems , 2006 .

[3]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[4]  Rameswar Debnath,et al.  A decision based one-against-one method for multi-class support vector machine , 2004, Pattern Analysis and Applications.

[5]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[6]  Babita Pandey,et al.  Knowledge and intelligent computing system in medicine , 2009, Comput. Biol. Medicine.

[7]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[8]  Andrew Hunter,et al.  Polynomial-fuzzy decision tree structures for classifying medical data , 2003, Knowl. Based Syst..

[9]  Elizabeth Tapia,et al.  Multiclass classification of microarray data samples with a reduced number of genes , 2011, BMC Bioinformatics.

[10]  Wenhuang Liu,et al.  Rare Class Mining: Progress and Prospect , 2009, 2009 Chinese Conference on Pattern Recognition.

[11]  Abdul Ghaaliq Lalkhen,et al.  Clinical tests: sensitivity and specificity , 2008 .

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Roger E Bumgarner,et al.  Multiclass classification of microarray data with repeated measurements: application to cancer , 2003, Genome Biology.

[17]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[19]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[20]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[21]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.