Data Integration and Data Exchange: It's Really About Time

With the deluge in the amount and variety of data in the world, it is rare for data that describes an entity to be completely contained and managed by a single data source. As a consequence, there is often great value in combining data about an entity from multiple sources, and also from versions of data reported by the same source over time. Data integration in which multiple dimensions of time may be expressed explicitly (e.g., as part of the data itself) or implicitly (e.g., the publication date of a data source), must be performed with great care. This is because each data source contains only partial (time-specific) knowledge about an entity, and thus their collective knowledge about the entity may contain conflicts that need to be resolved. In this paper, we call for a formal framework for data integration and data exchange across time that would facilitate the creation of consistent and integrated longitudinal knowledge about entities. We call such longitudinal knowledge of an entity its whenprovenance, which intuitively corresponds to when one knows what one knows about the entity. We believe that the vision and research directions described in this paper will serve to instigate the research and development of the next generation data integration and data exchange system, where both data and time can be reasoned on equal footing.

[1]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[2]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[3]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[4]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[5]  Gerhard Weikum,et al.  Longitudinal Analytics on Web Archive Data: It's About Time! , 2011, CIDR.

[6]  Catriel Beeri,et al.  A Proof Procedure for Data Dependencies , 1984, JACM.

[7]  Andrew B. Whinston,et al.  Model management , 1994 .

[8]  Max Petzold,et al.  Percentage of Patients with Preventable Adverse Drug Reactions and Preventability of Adverse Drug Reactions – A Meta-Analysis , 2012, PloS one.

[9]  llsoo Ahn,et al.  Temporal Databases , 1986, Computer.

[10]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[11]  Laks V. S. Lakshmanan,et al.  HepToX: Heterogeneous Peer to Peer XML Databases , 2005, ArXiv.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[14]  Carlo Zaniolo,et al.  Efficient Management of Multiversion Documents by Object Referencing , 2001, VLDB.

[15]  Stratis Viglas,et al.  Sorting hierarchical data in external memory for archiving , 2008, Proc. VLDB Endow..

[16]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[17]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[18]  Jan Chomicki,et al.  Temporal Query Languages: A Survey , 1994, ICTL.

[19]  Elke A. Rundensteiner,et al.  Sequence Pattern Query Processing over Out-of-Order Event Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Phokion G. Kolaitis,et al.  Peer data exchange , 2005, PODS '05.

[21]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[22]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[23]  Alin Deutsch,et al.  A deterministic model for semistructured data , 1999 .

[24]  Stuart E. Madnick,et al.  MIT Sloan School of Management , 2004 .

[25]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[26]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[27]  Richard T. Snodgrass,et al.  Temporal Database Entries for the Springer Encyclopedia of Database Systems , 2008 .

[28]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[29]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[31]  D. Gabbay,et al.  Many-Dimensional Modal Logics: Theory and Applications , 2003 .

[32]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[33]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[34]  Philip A. Bernstein,et al.  Model management 2.0: manipulating richer mappings , 2007, SIGMOD '07.

[35]  Jan Chomicki,et al.  Querying ATSQL databases with temporal logic , 1996, TODS.

[36]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[37]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[38]  Phokion G. Kolaitis,et al.  Designing and refining schema mappings via data examples , 2011, SIGMOD '11.

[39]  Neil Immerman,et al.  Recognizing patterns in streams with imprecise timestamps , 2010, Proc. VLDB Endow..

[40]  David Maier,et al.  Testing implications of data dependencies , 1979, SIGMOD '79.

[41]  Rajasekar Krishnamurthy,et al.  Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study , 2015, IEEE Data Eng. Bull..

[42]  Zachary G. Ives,et al.  Reconciling while tolerating disagreement in collaborative data sharing , 2006, SIGMOD Conference.

[43]  Egor V. Kostylev,et al.  Combining dependent annotations for relational algebra , 2012, ICDT '12.

[44]  Li Qian,et al.  Sample-driven schema mapping , 2012, SIGMOD Conference.

[45]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[46]  Ronald Fagin,et al.  Translating Web Data , 2002, VLDB.

[47]  Richard T. Snodgrass,et al.  The TSQL2 Temporal Query Language , 1995 .

[48]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[49]  A. Jha,et al.  Meaningful use of electronic health records: the road ahead. , 2010, JAMA.

[50]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[51]  Phokion G. Kolaitis,et al.  EIRENE: Interactive Design and Refinement of Schema Mappings via Data Examples , 2011, Proc. VLDB Endow..

[52]  James Cheney,et al.  The database Wiki project: a general-purpose platform for data curation and collaboration , 2011, SGMD.

[53]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS '05.

[54]  Carlo Curino,et al.  Managing and querying transaction-time databases under schema evolution , 2008, Proc. VLDB Endow..

[55]  Peter Buneman,et al.  XArch: archiving scientific and reference data , 2008, SIGMOD Conference.