NetCube: Fast, Approximate Database Queries Using Bayesian Networks

We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes

[1]  Mehdi Khosrowpour Cases on Database Technologies and Applications , 2006 .

[2]  Elvira Locuratolo,et al.  Database Design Based on B , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[3]  Douglas B. Bock,et al.  Accuracy in Modeling with Extended Entity Relationship and Object Oriented Data Models , 1993 .

[4]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[5]  Soon-Young Huh,et al.  Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems , 2007, J. Database Manag..

[6]  Omran A. Bukhres,et al.  BACIIS: Biological and Chemical Information Integration System , 2005, J. Database Manag..

[7]  Philip Calvert,et al.  Encyclopedia of Database Technologies and Applications , 2005 .

[8]  Hing Kwok Wong,et al.  Online Analytical Mining of Path Traversal Patterns for Web Measurement , 2002, J. Database Manag..

[9]  Ramesh Subramanian,et al.  Framework for a geographic districting DSS using an intelligent object-oriented model , 1996 .

[10]  Z. M. Ma Modeling Fuzzy Information in the IFO and Relational Data Model , 2006 .

[11]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[12]  John S. Erickson Database Technologies: Concepts, Methodologies, Tools, and Applications (4 Volumes) , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[13]  Derrick J. Neufeld,et al.  Isobord's geographic information system (GIS) solution , 2000 .

[14]  Keng Siau,et al.  Advanced Topics In Database Research , 2005 .

[15]  Hock-Hai Teo,et al.  An Experimental Study of Object-Oriented Query Language and Relational Query Language for Novice Users , 1994 .

[16]  Shigeaki Sakurai,et al.  An e-mail analysis method based on text mining techniques , 2005, Appl. Soft Comput..

[17]  Peter J. H. King,et al.  A Database Interface for Link Analysis , 2005, J. Database Manag..

[18]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .