Entity deduplication in big data graphs for scholarly communication

Several online services offer functionalities to access information from “big research graphs” (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts.,This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments.,GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph.,To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.

[1]  Erhard Rahm,et al.  Parallel Entity Resolution with Dedoop , 2012, Datenbank-Spektrum.

[2]  Ben Shneiderman,et al.  Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation , 2008, IEEE Transactions on Visualization and Computer Graphics.

[3]  Karol Pąk,et al.  Reducing vertices in property graphs , 2018, PloS one.

[4]  Paolo Manghi,et al.  GDup: De-Duplication of Scholarly Communication Big Graphs , 2018, 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT).

[5]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[6]  Huizhi Liang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Trans. Knowl. Data Eng..

[7]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[8]  Feng Xia,et al.  Big Scholarly Data: A Survey , 2017, IEEE Transactions on Big Data.

[9]  Paolo Manghi,et al.  OpenAIRE's DOIBoost - Boosting CrossRef for Research , 2018 .

[10]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[11]  Paolo Manghi,et al.  De-duplication of aggregation authority files , 2012, Int. J. Metadata Semant. Ontologies.

[12]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[13]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[14]  Paolo Manghi,et al.  The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures , 2014, Program.

[15]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[16]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Marko A. Rodriguez,et al.  Constructions from Dots and Lines , 2010, ArXiv.

[18]  Nicholette D. Palmer,et al.  Novel genetic associations for blood pressure identified via gene-alcohol interaction in up to 570K individuals across multiple ancestries , 2018, PloS one.