Technical Note: Naive Bayes for Regression

Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[5]  V. Rich Personal communication , 1989, Nature.

[6]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[7]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[8]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[11]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[12]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[13]  Pat Langley,et al.  Induction of Recursive Bayesian Classifiers , 1993, ECML.

[14]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[15]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[16]  Alexander G. Gray,et al.  Retrofitting Decision Tree Classifiers Using Kernel Density Estimation , 1995, ICML.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[19]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .

[20]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[21]  D. Kilpatrick,et al.  Numeric Prediction Using Instance-Based Learning with Encoding Length Selection , 1997, ICONIP.

[22]  David W. Aha,et al.  A Probabilistic Framework for Memory-Based Reasoning , 1998, Artif. Intell..

[23]  Leonard E. Trigg,et al.  Naive Bayes for regression , 1998 .

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  J. Simonoff Smoothing Methods in Statistics , 1998 .

[26]  Thomas P. Hettmansperger,et al.  Department of Statistics , 2003 .

[27]  Yong Wang,et al.  Using Model Trees for Classification , 1998, Machine Learning.

[28]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[29]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[30]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[31]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[32]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[33]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.