A 'non-parametric' version of the naive Bayes classifier

Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.

[1]  J. Royston The W Test for Normality , 1982 .

[2]  I. W. Evett,et al.  Rule induction in forensic science , 1989 .

[3]  Geoffrey I. Webb,et al.  Lazy Learning of Bayesian Rules , 2000, Machine Learning.

[4]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[5]  I. Ellis,et al.  A gene-expression signature to predict survival in breast cancer across independent data sets , 2007, Oncogene.

[6]  S. Appavu alias Balamurugan,et al.  NB+: An improved Naïve Bayesian algorithm , 2011, Knowl. Based Syst..

[7]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[8]  Robert A. Greevy Data Analysis and Graphics Using R: An Example-Based Approach , 2010 .

[9]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[10]  Mark A. Hall,et al.  A decision tree-based attribute weighting filter for naive Bayes , 2006, Knowl. Based Syst..

[11]  L. Xu,et al.  Comparisons of logistic regression and artificial neural network on power distribution systems fault cause identification , 2005, Proceedings of the 2005 IEEE Midnight-Summer Workshop on Soft Computing in Industrial Applications, 2005. SMCia/05..

[12]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[13]  Yi-Ping Phoebe Chen,et al.  Kernel-based naive bayes classifier for breast cancer prediction , 2007 .

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Ronald R. Yager,et al.  An extension of the naive Bayesian classifier , 2006, Inf. Sci..

[16]  G. Ball,et al.  High‐throughput protein expression analysis using tissue microarray technology of a large well‐characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses , 2005, International journal of cancer.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Chang-Hwan Lee Improving classification performance using unlabeled data: Naive Bayesian case , 2007, Knowl. Based Syst..

[19]  John H. Maindonald,et al.  Comprar Data Analysis and Graphics Using R | John Maindonald | 9780521762939 | Cambridge University Press , 2010 .

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[21]  Remco R. Bouckaert Naive Bayes Classifiers That Perform Well with Continuous Variables , 2004, Australian Conference on Artificial Intelligence.

[22]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[23]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[24]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[25]  J. P. Royston,et al.  Algorithm AS 181: The W Test for Normality , 1982 .

[26]  John H. Maindonald,et al.  Data Analysis and Graphics Using R: An Example-Based Approach , 2010 .

[27]  Yudong D. He,et al.  Expression profiling predicts outcome in breast cancer , 2002, Breast Cancer Research.

[28]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[29]  Tom. Mitchell GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning , 2005 .

[30]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  John Maindonald,et al.  Data Analysis and Graphics Using R: An Example-based Approach (Cambridge Series in Statistical and Probabilistic Mathematics) , 2003 .

[32]  Jan Paul Siebert,et al.  Vehicle Recognition Using Rule Based Methods , 1987 .

[33]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[34]  I. Ellis,et al.  The Nottingham prognostic index in primary breast cancer , 2005, Breast Cancer Research and Treatment.

[35]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[36]  Chung-Chian Hsu,et al.  Extended Naive Bayes classifier for mixed data , 2008, Expert Syst. Appl..

[37]  Jonathan M. Garibaldi,et al.  A Comparison of Three Different Methods for Classification of Breast Cancer Data , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[38]  Fengzhan Tian,et al.  A selective Bayes Classifier for classifying incomplete data based on gain ratio , 2008, Knowl. Based Syst..

[39]  Ian Witten,et al.  Data Mining , 2000 .

[40]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.