The Case for Holistic Data Integration

Current data integration approaches are mostly limited to few data sources, partly due to the use of binary match approaches between pairs of sources. We thus advocate for the development of more holistic, clustering-based data integration approaches that scale to many data sources. We outline different use cases and provide an overview of initial approaches for holistic schema/ontology integration and entity clustering. The discussion also considers open data repositories and so-called knowledge graphs.

[1]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[2]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[3]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[4]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[5]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[7]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[8]  Philip A. Bernstein,et al.  Merging Models Based on Given Correspondences , 2003, VLDB.

[9]  Markus Nentwig,et al.  LinkLion: A Link Repository for the Web of Data , 2014, ESWC.

[10]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[11]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[12]  Markus Nentwig,et al.  Holistic Entity Clustering for Linked Data , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[13]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[14]  Kenneth L. Clarkson,et al.  Schema covering: a step towards enabling reuse in information integration , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[16]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[17]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[18]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[19]  Surajit Chaudhuri,et al.  Experiences with using Data Cleaning Technology for Bing Services , 2012, IEEE Data Eng. Bull..

[20]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[22]  Zohra Bellahsene,et al.  PORSCHE: Performance ORiented SCHEma mediation , 2008, Inf. Syst..

[23]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  AnHai Doan,et al.  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing , 2014, Proc. VLDB Endow..

[25]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[26]  Ashraf Aboulnaga,et al.  Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems , 2010, SIGMOD Conference.

[27]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[28]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[29]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[30]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[31]  Jürgen Umbrich,et al.  A Comparison of Federation over SPARQL Endpoints Frameworks , 2013, KESW.

[32]  Torben Bach Pedersen,et al.  Multidimensional Databases and Data Warehousing , 2010, Multidimensional Databases and Data Warehousing.

[33]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[34]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[35]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[36]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[37]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[38]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[39]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[40]  Yuzhong Qu,et al.  How Matchable Are Four Thousand Ontologies on the Semantic Web , 2011, ESWC.

[41]  Maria Pershina,et al.  Holistic entity matching across knowledge graphs , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[42]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[43]  Erhard Rahm,et al.  Target-driven merging of taxonomies with Atom , 2014, Inf. Syst..

[44]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[45]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[46]  Amit P. Sheth,et al.  Ontology Alignment for Linked Open Data , 2010, SEMWEB.

[47]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[48]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[49]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[50]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[51]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[52]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[53]  Carlo Curino,et al.  WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis , 2013, Proc. VLDB Endow..

[54]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[55]  P. Patel-Schneider Towards Large-scale Schema And Ontology Matching , 2015 .

[56]  Heiko Paulheim,et al.  Knowledge graph refinement: A survey of approaches and evaluation methods , 2016, Semantic Web.

[57]  Erhard Rahm,et al.  On Matching Large Life Science Ontologies in Parallel , 2010, DILS.

[58]  Erhard Rahm,et al.  SemRep: A Repository for Semantic Mapping , 2015, BTW.

[59]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[60]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[61]  Jian Li,et al.  Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases , 2013, Proc. VLDB Endow..

[62]  Sören Auer,et al.  Enterprise Knowledge Graphs: A Backbone of Linked Enterprise Data , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[63]  Erhard Rahm,et al.  Composition Methods for Link Discovery , 2013, BTW.

[64]  Felix Naumann,et al.  Holistic and Scalable Ontology Alignment for Linked Open Data , 2012, LDOW.

[65]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[66]  Erhard Rahm,et al.  Mapping Composition for Matching Large Life Science Ontologies , 2011, ICBO.

[67]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..

[68]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[69]  Lise Getoor,et al.  TACI: Taxonomy-Aware Catalog Integration , 2013, IEEE Transactions on Knowledge and Data Engineering.

[70]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[71]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[72]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[73]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[74]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[75]  Wolfgang Lehner,et al.  Publish-time data integration for open data platforms , 2013, WOD '13.

[76]  Ioana Stanoi,et al.  Top-k generation of integrated schemas based on directed and weighted correspondences , 2009, SIGMOD Conference.

[77]  Jayant Madhavan,et al.  Applying WebTables in Practice , 2015, CIDR.

[78]  Oktie Hassanzadeh,et al.  Understanding a large corpus of web tables through matching with knowledge bases: an empirical study , 2015, OM.