GDup: De-Duplication of Scholarly Communication Big Graphs

Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.

[1]  Marko A. Rodriguez,et al.  The Graph Traversal Pattern , 2010, Graph Data Management.

[2]  Erhard Rahm,et al.  2 J un 2 01 5 GRADOOP : Scalable Graph Data Management and Analytics with Hadoop-Technical Report - , 2015 .

[3]  Paolo Manghi,et al.  De-duplication of aggregation authority files , 2012, Int. J. Metadata Semant. Ontologies.

[4]  Huizhi Liang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Trans. Knowl. Data Eng..

[5]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[6]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[7]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[8]  Ben Shneiderman,et al.  Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation , 2008, IEEE Transactions on Visualization and Computer Graphics.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Natalia Manola,et al.  An Infrastructure for Managing EC Funded Research Output: The OpenAIRE Project , 2010 .

[11]  Andreas Thor,et al.  Parallel Sorted Neighborhood Blocking with MapReduce , 2011, BTW.

[12]  Karol Pąk,et al.  Reducing vertices in property graphs , 2018, PloS one.

[13]  Olaf Hartig,et al.  Reconciliation of RDF* and Property Graphs , 2014, ArXiv.

[14]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[15]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[16]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[17]  Marko A. Rodriguez,et al.  Constructions from Dots and Lines , 2010, ArXiv.

[18]  Paolo Manghi,et al.  The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures , 2014, Program.

[19]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Alieh Saeedi,et al.  Using Link Features for Entity Clustering in Knowledge Graphs , 2018, ESWC.

[21]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[22]  Erhard Rahm,et al.  Parallel Entity Resolution with Dedoop , 2012, Datenbank-Spektrum.

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[25]  Paolo Manghi,et al.  The Data Model of the OpenAIRE Scientific Communication e-Infrastructure , 2012, MTSR.

[26]  Jim Webber,et al.  Graph Databases: New Opportunities for Connected Data , 2015 .

[27]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[28]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[29]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .