Databases

© A Heterogeneous Naive-Bayesian Classifier for Relational Databases Geetha Manjunath, M Narasimha Murty, Dinkar Sitaram HP Laboratories HPL-2009-225 Relational databases, Classification, Data Mining, RDF Most enterprise data is distributed in multiple relational databases with expert-designed schema. Application of single-table data mining techniques to distributed relational data not only incurs a computational penalty for converting to a "at" form (mega-join), even the human-specified semantic information present in the relations/schema is lost. Purely relational classification algorithms on the other hand, do consider detailed relationships between attributes. However, these techniques either require computationally intensive transformations or multiple analysis of fused datasets, which becomes infeasible in practical scenarios. Classification being one of the most popular predictive data mining tasks, we need practical algorithms that can be directly applied on existing databases. We present such a practical two-phase classification algorithm for relational databases with a semantic divide and conquer approach. We propose and prove a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual tables. Our approach also attempts to effectively leverage the semantic knowledge of the application that is hidden in the database schema using the Join Graph of an application. To automate the classification process, RDF (the core Semantic Web data model) is used for problem specification. A preliminary evaluation over TPCH and UCI benchmarks shows reduced training time in automated practical scenarios, without any loss of prediction accuracy. In fact, we show improved accuracy due to application of heterogeneous classifiers on individual tables by comparing it to other state-of-art techniques. External Posting Date: September 6, 2009 [Fulltext] Approved for External Publication Internal Posting Date: September 6, 2009 [Fulltext] Copyright 2009 Hewlett-Packard Development Company, L.P. A Heterogeneous Naive-Bayesian Classifier for Relational Databases

[1]  Hongyan Liu,et al.  An efficient multi-relational Naïve Bayesian classifier based on semantic relationship graph , 2005, MRDM '05.

[2]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[3]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[4]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[5]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[6]  Jingfeng Guo,et al.  An Efficient Relational Decision Tree Classification Algorithm , 2007, Third International Conference on Natural Computation (ICNC 2007).

[7]  Michelangelo Ceci,et al.  Mr-SBC: A Multi-relational Naïve Bayes Classifier , 2003, PKDD.

[8]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Dimitar Kazakov,et al.  Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique , 2007, ADBIS Research Communications.

[10]  Xindong Wu,et al.  Database classification for multi-database mining , 2005, Inf. Syst..

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[13]  Avi Pfeffer,et al.  Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[14]  Richard J. Cleary Applied Data Mining: Statistical Methods for Business and Industry , 2006 .

[15]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[16]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[17]  Peter A. Flach,et al.  Naive Bayesian Classification of Structured Data , 2004, Machine Learning.

[18]  Hongyu Guo,et al.  Mining relational databases with multi-view learning , 2005, MRDM '05.

[19]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[20]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[21]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[22]  Luc De Raedt,et al.  Top-down induction of logical decision trees , 1997 .

[23]  Hongjun Lu,et al.  Toward Multidatabase Mining: Identifying Relevant Databases , 2001, IEEE Trans. Knowl. Data Eng..

[24]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.