Experience: Type alignment on DBpedia and Freebase

Linked Open Data exhibits growth in both volume and variety of published data. Due to this variety, instances of many different types (e.g. Person) can be found in published datasets. Type alignment is the problem of automatically matching types (in a possibly many-many fashion) between two such datasets. Type alignment is an important preprocessing step in instance matching. Instance matching concerns identifying pairs of instances referring to the same underlying entity. By performing type alignment a priori, only instances conforming to aligned types are processed together, leading to significant savings. This article describes a type alignment experience with two large-scale cross-domain RDF knowledge graphs, DBpedia and Freebase, that contain hundreds, or even thousands, of unique types. Specifically, we present a MapReduce-based type alignment algorithm and show that there are at least three reasonable ways of evaluating type alignment within the larger context of instance matching. We comment on the consistency of those results, and note some general observations for researchers evaluating similar algorithms on cross-domain graphs.

[1]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[2]  Basil Ell,et al.  A Comparative Survey of DBpedia , Freebase , OpenCyc , Wikidata , and YAGO , 2015 .

[3]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[4]  Daniel P. Miranker,et al.  A two-step blocking scheme learner for scalable link discovery , 2014, OM.

[5]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[6]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.

[7]  Timothy W. Finin,et al.  Entity Type Recognition for Heterogeneous Semantic Graphs , 2013, AI Mag..

[8]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[11]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[12]  Qiang Yang,et al.  A Machine Learning Approach for Instance Matching Based on Similarity Metrics , 2012, SEMWEB.

[13]  Daniel P. Miranker,et al.  An unsupervised instance matcher for schema-free RDF data , 2015, J. Web Semant..

[14]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Mayank Kejriwal,et al.  Populating Entity Name Systems for Big Data Integration , 2014, SEMWEB.

[17]  Daniel P. Miranker,et al.  Sorted Neighborhood for Schema-Free RDF Data , 2015, ESWC.

[18]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[19]  Paolo Bouquet,et al.  OKKAM: Enabling a Web of Entities , 2007, I3.

[20]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[21]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[22]  Achille Fokoue,et al.  Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing , 2012, SEMWEB.