Lazy Learning of Bayesian Rules

The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.

[1]  Geoffrey I. Webb,et al.  Adjusted Probability Naive Bayesian Induction , 1998, Australian Joint Conference on Artificial Intelligence.

[2]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[3]  M. Pazzani Constructive Induction of Cartesian Product Attributes , 1998 .

[4]  Pat Langley,et al.  Induction of Recursive Bayesian Classifiers , 1993, ECML.

[5]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[6]  G. Provan Eecient Learning of Selective Bayesian Network Classiiers , 1996 .

[7]  Igor Kononenko,et al.  Inductive and Bayesian learning in medical diagnosis , 1993, Appl. Artif. Intell..

[8]  Dan Geiger,et al.  An Entropy-based Learning Algorithm of Bayesian Conditional Trees , 1992, UAI.

[9]  J. Kittler Feature selection and extraction , 1978 .

[10]  Nir Friedman,et al.  Building Classifiers Using Bayesian Networks , 1996, AAAI/IAAI, Vol. 2.

[11]  P. Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[12]  David H. Wolpert,et al.  The Relationship Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework , 1995 .

[13]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[14]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[15]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[16]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Gert Pfurtscheller,et al.  Discovering Patterns in EEG-Signals: Comparative Study of a Few Methods , 1993, ECML.

[19]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[20]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[21]  Victor R. Basili,et al.  A Pattern Recognition Approach for Software Engineering Data Analysis , 1992, IEEE Trans. Software Eng..

[22]  Simon Kasif,et al.  Local Induction of Decision Trees: Towards Interactive Data Mining , 1996, KDD.

[23]  W. Spears,et al.  For Every Generalization Action, Is There Really an Equal and Opposite Reaction? , 1995, ICML.

[24]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[25]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[26]  Belur V. Dasarathy,et al.  Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Geoffrey I. Webb,et al.  Classification Learning Using All Rules , 1998, ECML.

[28]  Kai Ming Ting,et al.  Discretization of Continuous-Valued Attributes and Instance-Based Learning , 1994 .

[29]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[30]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  Chris Chatfield,et al.  Statistics for Technology (A Course in Applied Statistics) , 1984 .

[33]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[34]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[35]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[36]  Brian R. Gaines,et al.  Current Trends in Knowledge Acquisition , 1990 .

[37]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[38]  Gregory M. Provan,et al.  A Comparison of Induction Algorithms for Selective and non-Selective Bayesian Classifiers , 1995, ICML.

[39]  Gregory M. Provan,et al.  Efficient Learning of Selective Bayesian Network Classifiers , 1996, ICML.

[40]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[41]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[42]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[43]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[44]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[45]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[46]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[47]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[48]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[49]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[50]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[51]  Ivan Bratko,et al.  ASSISTANT 86: A Knowledge-Elicitation Tool for Sophisticated Users , 1987, EWSL.

[52]  David H. Wolpert,et al.  The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning , 1994 .