An efficient multi-relational Naïve Bayesian classifier based on semantic relationship graph

Classification is one of the most popular data mining tasks with a wide range of applications, and lots of algorithms have been proposed to build accurate and scalable classifiers. Most of these algorithms only take a single table as input, whereas in the real world most data are stored in multiple tables and managed by relational database systems. As transferring data from multiple tables into a single one usually causes many problems, development of multi-relational classification algorithms becomes important and attracts many researchers' interests. Existing works about extending Naïve Bayes to deal with multi-relational data either have to transform data stored in tables to main-memory Prolog facts, or limit the search space to only a small subset of real world applications. In this work, we aim at solving these problems and building an efficient, accurate Naïve Bayesian classifier to deal with data in multiple tables directly. We propose an algorithm named Graph-NB, which upgrades Naïve Bayesian classifier to deal with multiple tables directly. In order to take advantage of linkage relationships among tables, and treat different tables linked to the target table differently, a semantic relationship graph is developed to describe the relationship and to avoid unnecessary joins. Furthermore, to improve accuracy, a pruning strategy is given to simplify the graph to avoid examining too many weakly linked tables. Experimental study on both real-world and synthetic databases shows its high efficiency and good accuracy.

[1]  I. Kononenko,et al.  Linear Space Induction in First Order Logic with Relieff , 1995 .

[2]  Peter A. Flach,et al.  1BC2: A True First-Order Bayesian Classifier , 2002, ILP.

[3]  Luc De Raedt,et al.  Top-down induction of logical decision trees , 1997 .

[4]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[7]  Jennifer Neville,et al.  Simple estimators for relational Bayesian classifiers , 2003, Third IEEE International Conference on Data Mining.

[8]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[9]  Peter A. Flach,et al.  IBC: A First-Order Bayesian Classifier , 1999, ILP.

[10]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[11]  Igor Kononenko,et al.  Naive Bayesian classifier within ILP-R , 1995 .

[12]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[13]  Hendrik Blockeel,et al.  Knowledge Discovery in Databases: PKDD 2003 , 2003, Lecture Notes in Computer Science.

[14]  S. Muggleton,et al.  The role of background knowledge : using a problemfrom chemistry to examine the performance of anILP program , 1996 .

[15]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[16]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[17]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[18]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[19]  George H. John Enhancements to the data mining process , 1997 .

[20]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[21]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[22]  Peter A. Flach,et al.  First-Order Bayesian Classification with 1BC , 2007 .

[23]  Michelangelo Ceci,et al.  Mr-SBC: A Multi-relational Naïve Bayes Classifier , 2003, PKDD.

[24]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.