Open Data Integration

Open data plays a major role in supporting both governmental and organizational transparency. Many organizations are adopting Open Data Principles promising to make their open data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However, scientists generally do not have a priori knowledge about what data is available (its schema or content). Nevertheless, they want to be able to use open data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called query discovery where the main task is to discover a query (or transformation) that translates data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. We introduce a new paradigm for thinking about integration where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. We describe a research agenda and recent progress in developing scalable data-analysis or query-aware data discovery algorithms that provide high recall and accuracy over massive data repositories. PVLDB Reference Format: Renée J. Miller. Open Data Integration. PVLDB, 11 (12):2130-2139, 2018. DOI: https://doi.org/10.14778/3229863.3240491

[1]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[2]  Laura M. Haas,et al.  Explaining Data Integration , 2018, IEEE Data Eng. Bull..

[3]  Paolo Papotti,et al.  Schema Mapping and Data Exchange Tools: Time for the Golden Age , 2012, it Inf. Technol..

[4]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[5]  Sebastian Link,et al.  Data Quality: The Role of Empiricism , 2018, SGMD.

[6]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[7]  Christian Bizer,et al.  Stitching Web Tables for Improving Matching Quality , 2017, Proc. VLDB Endow..

[8]  Renée J. Miller,et al.  Schema equivalence in heterogeneous systems: bridging theory and practice , 1994, Inf. Syst..

[9]  Ian T. Foster,et al.  Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories , 2017, SSDBM.

[10]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[11]  Renée J. Miller,et al.  A framework for semantic link discovery over relational data , 2009, CIKM.

[12]  Richard Hull,et al.  Relative information capacity of simple relational database schemata , 1984, SIAM J. Comput..

[13]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[14]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[15]  Alon Y. Halevy,et al.  Synthesizing Union Tables from the Web , 2013, IJCAI.

[16]  Chen Li,et al.  Answering approximate string queries on large data sets using external memory , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[18]  Renée J. Miller,et al.  Interactive Navigation of Open Data Linkages , 2017, Proc. VLDB Endow..

[19]  Anthony Kosky,et al.  Semantics of Database Transformations , 1995, Semantics in Databases.

[20]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[21]  Renée J. Miller Using schematically heterogeneous structures , 1998, SIGMOD '98.

[22]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[23]  Felix Naumann,et al.  Detecting Inclusion Dependencies on Very Many Tables , 2017, TODS.

[24]  Paolo Papotti,et al.  Benchmarking Data Curation Systems , 2016, IEEE Data Eng. Bull..

[25]  Weifeng Su,et al.  Holistic Schema Matching for Web Query Interfaces , 2006, EDBT.

[26]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[27]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[28]  Guoliang Li,et al.  Human-in-the-loop Data Integration , 2017, Proc. VLDB Endow..

[29]  Renée J. Miller,et al.  Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..

[30]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[31]  Renée J. Miller,et al.  Leveraging data and structure in ontology integration , 2007, SIGMOD '07.

[32]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[33]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[34]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[35]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.

[36]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[37]  Paolo Papotti,et al.  ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange , 2011, Proc. VLDB Endow..

[38]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[39]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[40]  Philip A. Bernstein,et al.  HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching , 2009, Proc. VLDB Endow..

[41]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[42]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[43]  Renée J. Miller,et al.  The iBench Integration Metadata Generator , 2015, Proc. VLDB Endow..

[44]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[45]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[46]  Eser Kandogan,et al.  LabBook: Metadata-driven social collaborative data analysis , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[47]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[48]  Yeye He,et al.  Auto-Join: Joining Tables by Leveraging Transformations , 2017, Proc. VLDB Endow..

[49]  Phokion G. Kolaitis Reflections on Schema Mappings, Data Exchange, and Metadata Management , 2018, PODS.

[50]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[51]  Rajasekar Krishnamurthy,et al.  Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study , 2015, IEEE Data Eng. Bull..

[52]  Renée J. Miller,et al.  Automatic Curation of Clinical Trials Data in LinkedCT , 2015, International Semantic Web Conference.

[53]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[54]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[55]  Ahmed K. Elmagarmid,et al.  Leveraging query logs for schema mapping generation in U-MAP , 2011, SIGMOD '11.

[56]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[57]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[58]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[59]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[60]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[61]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[62]  Renée J. Miller,et al.  LinkedCT: A Linked Data Space for Clinical Trials , 2009, ArXiv.

[63]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[64]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[65]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[66]  Dominique Ritze,et al.  Matching HTML Tables to DBpedia , 2015, WIMS.

[67]  Rajasekar Krishnamurthy,et al.  HIL: a high-level scripting language for entity integration , 2013, EDBT '13.

[68]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[69]  Heiko Paulheim,et al.  The Mannheim Search Join Engine , 2015, J. Web Semant..

[70]  Phokion G. Kolaitis Schema mappings and data examples , 2011, LID '11.

[71]  Renée J. Miller,et al.  VizCurator: A Visual Tool for Curating Open Data , 2015, WWW.

[72]  P. Patel-Schneider Towards Large-scale Schema And Ontology Matching , 2015 .

[73]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[74]  Limsoon Wong,et al.  A Data Transformation System for Biological Data Sources , 1995, VLDB.