Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

One challenge for the Semantic Web is to scalably establish high quality owl: same As links between co referent ontology instances in different data sources, traditional approaches that exhaustively compare every pair of instances do not scale well to large datasets. In this paper, we propose a pruning-based algorithm for reducing the complexity of entity co reference. First, we discard candidate pairs of instances that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. In our prior work, each instance is associated with a context graph consisting of neighboring RDF nodes. In this paper, we speed up the comparison for a single pair of instances by pruning insignificant context in the graph, this is accomplished by evaluating its potential contribution to the final similarity measure. We evaluate our system on three Semantic Web instance categories. We verify the effectiveness of our thresholding and context pruning methods by comparing to nine state-of-the-art systems. We show that our algorithm frequently outperforms those systems with a runtime speedup factor of 18 to 24 while maintaining competitive F1-scores. For datasets of up to 1 million instances, this translates to as much as 370 hours improvement in runtime.

[1]  Arjen P. de Vries,et al.  SERIMI results for OAEI 2011 , 2011, OM.

[2]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[4]  Yuzhong Qu,et al.  A self-training approach for resolving object coreference on the semantic web , 2011, WWW.

[5]  Mansur R. Kabuka,et al.  Ontology matching with semantic verification , 2009, J. Web Semant..

[6]  Guoliang Li,et al.  Trie-join , 2010, Proc. VLDB Endow..

[7]  Nathalie Pernelle,et al.  Combining a Logical and a Numerical Method for Data Reconciliation , 2009, J. Data Semant..

[8]  Guilin Qi,et al.  Zhishi.me - Weaving Chinese Linking Open Data , 2011, SEMWEB.

[9]  Heiner Stuckenschmidt,et al.  Leveraging Terminological Structure for Object Reconciliation , 2010, ESWC.

[10]  Yong Yu,et al.  Leveraging Unlabeled Data to Scale Blocking for Record Linkage , 2011, IJCAI.

[11]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[12]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[13]  Hugh Glaser,et al.  RKBExplorer.com: A Knowledge Driven Infrastructure for Linked Data Providers , 2008, ESWC.

[14]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Craig A. Knoblock,et al.  Creating Relational Data from Unstructured and Ungrammatical Data Sources , 2008, J. Artif. Intell. Res..

[16]  Kalina Bontcheva,et al.  Mining Information for Instance Unification , 2006, SEMWEB.

[17]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[18]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[19]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[20]  Ismailcem Budak Arpinar,et al.  Ontology-Driven Automatic Entity Disambiguation in Unstructured Text , 2006, SEMWEB.

[21]  Cosmin Stroe,et al.  AgreementMaker: Efficient Matching for Large Real-World Schemas and Ontologies , 2009, Proc. VLDB Endow..

[22]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[23]  Jeff Heflin,et al.  Domain-Independent Entity Coreference for Linking Ontology Instances , 2013, JDIQ.

[24]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[25]  Jeff Heflin,et al.  A Pruning Based Approach for Scalable Entity Coreference , 2012, FLAIRS Conference.