Medical Datamining with a New Algorithm for Feature Selection and Naive Bayesian Classifier

Much research work in datamining has gone into improving the predictive accuracy of statistical classifiers by applying the techniques of discretization and feature selection. As a probability-based statistical classification method, the Naive Bayesian classifier has gained wide popularity despite its assumption that attributes are conditionally mutually independent given the class label. In this paper we propose a new feature selection algorithm to improve the classification accuracy of Naive Bayes with respect to medical datasets. Our experimental results with 17 medical datasets suggest that on an average the new CHI-WSS algorithm gave best results. The proposed algorithm utilizes discretization and simplifies the' wrapper' approach based feature selection by reducing the feature dimensionality through the elimination of irrelevant and least relevant features using chi-square statistics. For our experiments we utilize two established measures to compare the performance of statistical classifiers namely; classification accuracy (or error rate) and the area under ROC to demonstrate that the proposed algorithm using generative Naive Bayesian classifier on the average is more efficient than using discriminative models namely logistic regression and support vector machine.

[1]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[2]  Sushmita Mitra,et al.  Fuzzy MLP based expert system for medical diagnosis , 1994, CVPR 1994.

[3]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[4]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  David F. Lobach,et al.  Medical data mining: knowledge discovery in a clinical data warehouse , 1997, AMIA.

[9]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  Yoichi Hayashi,et al.  Neural expert system using fuzzy teaching input and its application to medical diagnosis , 1994 .

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Charles X. Ling,et al.  A Fundamental Issue of Naive Bayes , 2003, Canadian Conference on AI.

[14]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[15]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[16]  Ying Yang,et al.  A comparative study of discretization methods for naive-Bayes classifiers , 2002 .

[17]  Huan Liu,et al.  Some issues on scalable feature selection , 1998 .

[18]  Chun-Nan Hsu,et al.  Why Discretization Works for Naive Bayesian Classifiers , 2000, ICML.

[19]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[20]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[21]  S. S. Iyengar,et al.  A comparative analysis of discretization methods for Medical Datamining with Naive Bayesian classifier , 2006, 9th International Conference on Information Technology (ICIT'06).

[22]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.