Trio: A System for Integrated Management of Data, Accuracy, and Lineage

Trio is a new database system that manages not only data, but also the accuracy and lineage of the data. Approximate (uncertain, probabilistic, incomplete, fuzzy, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio project are to distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system---a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This paper provides numerous motivating applications for Trio and lays out preliminary plans for the data model, query language, and prototype system.

[1]  Sharad Mehrotra,et al.  Approximate selection queries over imprecise data , 2004, Proceedings. 20th International Conference on Data Engineering.

[2]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[3]  Hiroshi Nakajima,et al.  Efficient Processing of Nested Fuzzy SQL Queries in a Fuzzy Database , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[5]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[6]  Philip A. Bernstein,et al.  Meta-Data Support for Data Transformations Using Microsoft Repository , 1999, IEEE Data Eng. Bull..

[7]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[8]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[9]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Sally I. McClean,et al.  Aggregation of Imprecise and Uncertain Information in Databases , 2001, IEEE Trans. Knowl. Data Eng..

[11]  Norbert Fuhr,et al.  A Probabilistic Framework for Vague Queries and Imprecise Information in Databases , 1990, VLDB.

[12]  James Frew,et al.  Earth System Science Workbench: a data management infrastructure for earth science products , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[13]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[14]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[15]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[16]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[17]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[18]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[19]  Alexander S. Szalay,et al.  Online scientific data curation, publication, and archiving , 2002, SPIE Astronomical Telescopes + Instrumentation.

[20]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[21]  Hui-I Hsiao,et al.  DLFM: a transactional resource manager , 2000, SIGMOD '00.

[22]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[23]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[24]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[25]  Praveen Seshadri,et al.  PREDATOR: an OR-DBMS with enhanced data types , 1997, SIGMOD '97.

[26]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[27]  llsoo Ahn,et al.  Temporal Databases , 1986, Computer.

[28]  Lois M. L. Delcambre,et al.  Superimposed Information for the Internet , 1999, WebDB.

[29]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[30]  Rajshekhar Sunderraman,et al.  Indefinite and maybe information in relational databases , 1990, TODS.

[31]  Philippe Bonnet,et al.  GADT: a probability space ADT for representing and querying the physical world , 2002, Proceedings 18th International Conference on Data Engineering.

[32]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[33]  Peter J. Haas,et al.  Online Query Processing , 2001, SIGMOD Conference.

[34]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[35]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[36]  Ramakrishnan Srikant,et al.  Hippocratic Databases , 2002, VLDB.

[37]  Fereidoon Sadri,et al.  Modeling uncertainty in databases , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[38]  Michael Stonebraker,et al.  An Implementation of Hypothetical Relations , 1983, VLDB.

[39]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[40]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[41]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[42]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[43]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[44]  B. Buckles,et al.  A fuzzy representation of data for relational databases , 1982 .

[45]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[46]  Doron Rotem,et al.  Uncertain, Incomplete, and Inconsistent Data in Scientific and Statistical Databases , 1996, Uncertainty Management in Information Systems.

[47]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[48]  Lois M. L. Delcambre,et al.  Bundles in captivity: an application of superimposed information , 2001, Proceedings 17th International Conference on Data Engineering.

[49]  Stuart E. Madnick,et al.  A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective , 1990, VLDB.

[50]  Amihai Motro,et al.  Management of uncertainty in database systems , 1995 .

[51]  Jennifer Widom,et al.  Adaptive precision setting for cached approximate values , 2001, SIGMOD '01.

[52]  Gultekin Özsoyoglu,et al.  Incomplete Relational Database Models Based on Intervals , 1993, IEEE Trans. Knowl. Data Eng..

[53]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[54]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[55]  Nicolas Spyratos,et al.  Update semantics of relational views , 1981, TODS.

[56]  Michael Stonebraker,et al.  The Design of the POSTGRES Storage System , 1988, VLDB.

[57]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[58]  Richard Hull,et al.  A framework for implementing hypothetical queries , 1997, SIGMOD '97.