Semantic Link Discovery over Relational Data

To make semantic search a reality, we need to be able to efficiently publish large data sets containing rich semantic structure. We have tools for translating relational and semi-structured data into RDF, but such translation tools do not have the goal of adding or providing the kind of semantics necessary to achieve the goals of the Semantic Web and semantic search over the Web. In this chapter, we present LinQuer, a tool for creating semantic links within a data source and between data sources. We focus on link discovery over structured (relational) data since many Semantic Web sources are the result of publishing relational data as RDF and since relational engines provide the scalability and flexibility we need for large scale link discovery. The LinQuer framework is based on the declarative specification of linkage requirements by a user. We present algorithms for translating these requirements to queries that can run over relational data sources, potentially using semantic information (such as a class hierarchy or a more general ontology) to enhance the recall of the link discovery. We show that this framework is flexible enough to permit linking real data, including dirty data (which is commonly found on the Web) and data with a variety of semantic connections.

[1]  Jens Lehmann,et al.  Triplify: light-weight linked data publication from relational databases , 2009, WWW '09.

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[4]  Min Wang,et al.  Supporting Ontology-based Keyword Search over Medical Databases , 2008, AMIA.

[5]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[6]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[7]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[8]  Santosh S. Vempala,et al.  Locality-preserving hashing in multidimensional spaces , 1997, STOC '97.

[9]  Eugene Inseok Chong,et al.  Supporting Ontology-Based Semantic matching in RDBMS , 2004, VLDB.

[10]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[11]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[12]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[13]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[14]  Felix Naumann,et al.  Automatic Data Fusion with HumMer , 2005, VLDB.

[15]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[16]  Renée J. Miller,et al.  LinkedCT: A Linked Data Space for Clinical Trials , 2009, ArXiv.

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[19]  Orri Erling,et al.  Virtuoso: RDF Support in a Native RDBMS , 2009, Semantic Web Information Management.

[20]  Renée J. Miller,et al.  Linking Semistructured Data on the Web , 2011, WebDB.

[21]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[22]  Renée J. Miller,et al.  A framework for semantic link discovery over relational data , 2009, CIKM.

[23]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[24]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[25]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..