MauveDB: supporting model-based user views in database systems

Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks --- tends to be incomplete, imprecise, and erroneous, making it impossible to present it to users or feed it directly into applications. The traditional approach to dealing with this problem is to first process the data using statistical or probabilistic models that can provide more robust interpretations of the data. Current database systems, however, do not provide adequate support for applying models to such data, especially when those models need to be frequently updated as new data arrives in the system. Hence, most scientists and engineers who depend on models for managing their data do not use database systems for archival or querying at all; at best, databases serve as a persistent raw data store.In this paper we define a new abstraction called model-based views and present the architecture of MauveDB, the system we are building to support such views. Just as traditional database views provide logical data independence, model-based views provide independence from the details of the underlying data generating mechanism and hide the irregularities of the data by using models to present a consistent view to the users. MauveDB supports a declarative language for defining model-based views, allows declarative querying over such views using SQL, and supports several different materialization strategies and techniques to efficiently maintain them in the face of frequent updates. We have implemented a prototype system that currently supports views based on regression and interpolation, using the Apache Derby open source DBMS, and we present results that show the utility and performance benefits that can be obtained by supporting several different types of model-based views in a database system.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[3]  Selim G. Akl,et al.  Views for Multilevel Database Security , 1987, IEEE Transactions on Software Engineering.

[4]  Gösta Grahne Horn tables-an efficient tool for handling incomplete information in databases , 1989, PODS '89.

[5]  John Price-Wilkin,et al.  Oxford English Dictionary (2nd ed.) , 1991 .

[6]  Leonore Neugebauer Optimization and evaluation of database queries including embedded interpolation procedures , 1991, SIGMOD '91.

[7]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[8]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[9]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[10]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[11]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[12]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[13]  Stéphane Grumbach,et al.  Manipulating Interpolated Data is Easier than You Thought , 2000, VLDB.

[14]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[15]  Jerry Zhao,et al.  Habitat monitoring: application driver for wireless communications technology , 2001, CCRV.

[16]  Ian F. Akyildiz,et al.  Wireless sensor networks: a survey , 2002, Comput. Networks.

[17]  Feng Zhao,et al.  Scalable Information-Driven Sensor Querying and Routing for Ad Hoc Heterogeneous Sensor Networks , 2002, Int. J. High Perform. Comput. Appl..

[18]  Surajit Chaudhuri,et al.  Efficient evaluation of queries with mining predicates , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  John Anderson,et al.  Wireless sensor networks for habitat monitoring , 2002, WSNA '02.

[20]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[21]  G. Phillips Interpolation and Approximation by Polynomials , 2003 .

[22]  Jenna Burrell,et al.  From ethnography to design in a vineyard , 2003, DUX '03.

[23]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[24]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[25]  C. Guestrin,et al.  Distributed regression: an efficient framework for modeling sensor network data , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[26]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[27]  Johannes Gehrke,et al.  Query Processing in Sensor Networks , 2003, CIDR.

[28]  Sunil Prabhakar,et al.  Indexing continuously changing data with mean-variance tree , 2005, SAC '05.

[29]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[30]  Henry A. Kautz,et al.  Location-Based Activity Recognition using Relational Markov Networks , 2005, IJCAI.

[31]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[33]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.