Pay-as-you-go data integration for linked data: opportunities, challenges and architectures

Linked Data (LD) provides principles for publishing data that underpin the development of an emerging web of data. LD follows the web in providing low barriers to entry: publishers can make their data available using a small set of standard technologies, and consumers can search for and browse published data using generic tools. Like the web, consumers frequently consume data in broadly the form in which it was published; this will be satisfactory in some cases, but the diversity of publishers means that the data required to support a task may be stored in many different sources, and described in many different ways. As such, although RDF provides a syntactically homogeneous language for describing data, sources typically manifest a wide range of heterogeneities, in terms of how data on a concept is represented. This paper makes the case that many aspects of both publication and consumption of LD stand to benefit from a pay-as-you-go approach to data integration. Specifically, the paper: (i) identifies a collection of opportunities for applying pay-as-you-go techniques to LD; (ii) describes some preliminary experiences applying a pay-as-you-go data integration system to LD; and (iii) presents some open issues that need to be addressed to enable the full benefits of pay-as-you go integration to be realised.

[1]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[2]  Fausto Giunchiglia,et al.  Semantic Matching with S-Match , 2009, Semantic Web Information Management.

[3]  Angela Bonifati,et al.  Schema mapping verification: the spicy way , 2008, EDBT '08.

[4]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[5]  Won Kim,et al.  On resolving schematic heterogeneity in multidatabase systems , 1995, Distributed and Parallel Databases.

[6]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[7]  Norman W. Paton,et al.  Defining and Using Schematic Correspondences for Automatically Generating Schema Mappings , 2009, CAiSE.

[8]  Roberto De Virgilio,et al.  A scalable and extensible framework for query answering over RDF , 2011, World Wide Web.

[9]  Wang Chiew Tan,et al.  Debugging schema mappings with routes , 2006, VLDB.

[10]  Stefan Decker,et al.  Sig.ma: Live views on the Web of Data , 2010, J. Web Semant..

[11]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[12]  Felix Naumann,et al.  Mapping XML and relational schemas with Clio , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[14]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[15]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .

[16]  Lior Rokach,et al.  A survey of Clustering Algorithms , 2010, Data Mining and Knowledge Discovery Handbook.

[17]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Marcos Antonio Vaz Salles,et al.  Pay-as-you-go information integration in personal and social dataspaces , 2008 .

[19]  Christian Bizer,et al.  Executing SPARQL Queries over the Web of Linked Data , 2009, SEMWEB.

[20]  Ahmed K. Elmagarmid,et al.  Leveraging query logs for schema mapping generation in U-MAP , 2011, SIGMOD '11.

[21]  S. Debowski Knowledge Management , 2005 .

[22]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[23]  Jeremy J. Carroll,et al.  Matching RDF Graphs , 2002, SEMWEB.

[24]  Giorgio Orsi,et al.  Semantic data markets: a flexible environment for knowledge management , 2011, CIKM '11.

[25]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[26]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[27]  Robert Isele,et al.  Learning linkage rules using genetic programming , 2011, OM.

[28]  Antonio L. Furtado,et al.  Evaluation of Similarity Measures and Heuristics for Simple RDF Schema Matching , 2008 .

[29]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[30]  Norman W. Paton,et al.  User Feedback as a First Class Citizen in Information Integration Systems , 2011, CIDR.

[31]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[32]  Jürgen Umbrich,et al.  Comparing data summaries for processing live queries over Linked Data , 2011, World Wide Web.

[33]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[34]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[35]  K. Selçuk Candan,et al.  Feedback-driven result ranking and query refinement for exploring semi-structured data collections , 2010, EDBT '10.

[36]  Haofen Wang,et al.  Hermes: Data Web search on a pay-as-you-go integration infrastructure , 2009, J. Web Semant..

[37]  Norman W. Paton,et al.  Feedback-based annotation, selection and refinement of schema mappings for dataspaces , 2010, EDBT '10.

[38]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[39]  Norman W. Paton,et al.  DSToolkit: An Architecture for Flexible Dataspace Management , 2012, Trans. Large Scale Data Knowl. Centered Syst..

[40]  Norman W. Paton,et al.  Pay-as-you-go mapping selection in dataspaces , 2011, SIGMOD '11.

[41]  Paolo Atzeni,et al.  A Universal Metamodel and Its Dictionary , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[42]  Norman W. Paton,et al.  Dimensions of Dataspaces , 2009, BNCOD.

[43]  Stefan Decker,et al.  Sig.ma: live views on the web of data , 2010, WWW '10.

[44]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.