Continuous Uncertainty in Trio

We present extensions to Trio for incorporating continuous uncertainty into the system. Data items with uncertain possible values drawn from a continuous domain are represented through a generic set of functions. Our approach enables precise and efficient representation of arbitrary probability distribution functions, along with standard distributions such as Gaussians. We also describe how queries are processed efficiently over this representation, without knowledge of specific distributions. For queries that cannot be answered exactly, we can provide approximate answers using sampling or histogram approximations, offering the user a cost-precision trade-off. Our approach exploits Trio’s lineage and confidence features, with smooth integration into the overall data model and system.

[1]  Patrick Bosc,et al.  About projection-selection-join queries addressed to possibilistic relational databases , 2005, IEEE Transactions on Fuzzy Systems.

[2]  Todd J. Green,et al.  Containment of Conjunctive Queries on Annotated Relations , 2009, ICDT '09.

[3]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[4]  Parag Agrawal,et al.  Trio-ER: The Trio System as a Workbench for Entity-Resolution , 2009 .

[5]  Y. Edmund Lien,et al.  On the Equivalence of Database Models , 1982, JACM.

[6]  A. Hamilton,et al.  Evidence for biasing in the CfA survey , 1988 .

[7]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[8]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[9]  Charu C. Aggarwal,et al.  Trio A System for Data Uncertainty and Lineage , 2009 .

[10]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[11]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[12]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[13]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[14]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Andrew J. Connolly,et al.  Marked correlations in galaxy formation models , 2005 .

[16]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[17]  A. Mazure,et al.  The VIMOS VLT deep survey , 2008, 0903.0271.

[18]  C. Koch,et al.  Worlds and Beyond : Effcient Representation and Processing of Incomplete Information , 2007 .

[19]  Arun K. Majumdar,et al.  Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems , 1988, ACM Trans. Database Syst..

[20]  V. S. Subrahmanian,et al.  Aggregate Query Answering under Uncertain Schema Mappings , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Dan Olteanu,et al.  MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Paolo Atzeni,et al.  Functional Dependencies and Constraints on Null Values in Database Relations , 1986, Inf. Control..

[23]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Theodoor Scholte,et al.  Managing continuous uncertain data by a probabilisticXML database management system , 2008 .

[25]  Padova,et al.  On the environmental dependence of halo formation , 2004 .

[26]  R. Ellis,et al.  The 2dF Galaxy Redshift Survey: the dependence of galaxy clustering on luminosity and spectral type , 2001, astro-ph/0112043.

[27]  B. Bhuniya,et al.  Lossless Join Property in Fuzzy Relational Databases , 1993, Data Knowl. Eng..

[28]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  M. Postman,et al.  The morphology-density relation - The group connection , 1984 .

[30]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[31]  Christoph Koch,et al.  On Query Algebras for Probabilistic Databases , 2009, SGMD.

[32]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[33]  Philippe Bonnet,et al.  GADT: a probability space ADT for representing and querying the physical world , 2002, Proceedings 18th International Conference on Data Engineering.

[34]  Eter,et al.  WEAK-LENSING HALO NUMBERS AND DARK-MATTER PROFILES , 2001 .

[35]  A. Mazure,et al.  The VIMOS-VLT deep survey - galaxy luminosity function per morphological type up to z = 1.2 , 2006 .

[36]  Ronald Fagin,et al.  A logic for reasoning about probabilities , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[37]  Sumit Sarkar,et al.  Generalized Normal Forms for Probabilistic Relational Data , 2002, IEEE Trans. Knowl. Data Eng..

[38]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[39]  Samuel Madden,et al.  Querying continuous functions in a database system , 2008, SIGMOD Conference.

[40]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[41]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[42]  B. Tyagi,et al.  Fuzzy functional dependencies and independencies in extended fuzzy relational database models , 1995 .

[43]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[44]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[45]  Mark Levene,et al.  Axiomatisation of Functional Dependencies in Incomplete Relations , 1998, Theor. Comput. Sci..

[46]  Wayne Hu,et al.  Effects of Photometric Redshift Uncertainties on Weak-Lensing Tomography , 2005 .

[47]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[48]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[49]  Witold Lipski,et al.  On semantic issues connected with incomplete information databases , 1979, ACM Trans. Database Syst..

[50]  Michael Zink,et al.  Capturing Data Uncertainty in High-Volume Stream Processing , 2009, CIDR.

[51]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[52]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[53]  Norbert Fuhr,et al.  A probabilistic NF2 relational algebra for integrated information retrieval and database systems , 1996 .

[54]  R. Nichol,et al.  On Departures from a Power Law in the Galaxy Correlation Function , 2003, astro-ph/0301280.

[55]  Christoph Koch,et al.  A compositional framework for complex queries over uncertain data , 2009, ICDT '09.

[56]  V. Le Brun,et al.  Photometric Redshifts of Galaxies in COSMOS , 2006, astro-ph/0612344.

[57]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[58]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[59]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[60]  David J. DeWitt,et al.  Parallel algorithms for the execution of relational database operations , 1983, TODS.

[61]  Guoqing Chen,et al.  Fuzzy Functional Dependency and a Series of Design Issues of Fuzzy Relational Databases , 1995 .

[62]  Wei Yi Liu,et al.  The fuzzy functional dependency on the basis of the semantic distance , 1993 .

[63]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[64]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[65]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[66]  Neta A. Bahcall,et al.  The Dependence on Environment of the Color-Magnitude Relation of Galaxies , 2003, astro-ph/0307336.

[67]  M. Giavalisco,et al.  Photometric redshifts of galaxies in COSMOS , 2006 .

[68]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[69]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[70]  Christopher Ré,et al.  Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization , 2007, VLDB.

[71]  Dan Suciu,et al.  The Boundary Between Privacy and Utility in Data Publishing , 2007, VLDB.

[72]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[73]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[74]  Thomas Lukasiewicz,et al.  Extension of the Relational Algebra to Probabilistic Complex Values , 2000, FoIKS.

[75]  Henri Prade,et al.  Lipski's approach to incomplete information databases restated and generalized in the setting of Zadeh's possibility theory , 1984, Inf. Syst..

[76]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[77]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[78]  A. Szalay,et al.  Slicing Through Multicolor Space: Galaxy Redshifts from Broadband Photometry , 1995, astro-ph/9508100.

[79]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[80]  Andrew W. Moore,et al.  A multiple tree algorithm for the efficient association of asteroid observations , 2005, KDD '05.

[81]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[82]  Yannis Vassiliou Functional Dependencies and Incomplete Information , 1980, VLDB.

[83]  Juan C. Cubero,et al.  A new definition of fuzzy functional dependency in fuzzy relational databases , 1994, Int. J. Intell. Syst..