Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI

We present JedAI, a new open-source toolkit for endto- end Entity Resolution. JedAI is domain-agnostic in the sense that it does not depend on background expert knowledge, applying seamlessly to data of any domain with minimal human intervention. JedAI is also structure-agnostic, as it can process any type of data, ranging from structured (relational) to semi-structured (RDF) and un-structured (free-text) entity descriptions. JedAI consists of two parts: (i) JedAI-core is a library of numerous state-of-the-art methods that can be mixed and matched to form (thousands of) end-to-end workflows, allowing for easily benchmarking their relative performance. (ii) JedAI-gui is a user-friendly desktop application that facilitates the composition of complex workflows via a wizard-like interface. It is suitable for both lay and power users, offering concrete guidelines and automatic configuration, as well as manual configuration options, visual exploration, and detailed statistics for each method's performance. In this paper, we also delve into the new features of JedAI's latest version (2.1), and demonstrate its performance experimentally.

[1]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[2]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[3]  Marcos André Gonçalves,et al.  BLOSS: Effective meta-blocking with almost no effort , 2018, Inf. Syst..

[4]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[5]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[7]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[8]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[9]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[10]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.

[11]  George Papastefanatos,et al.  Supervised Meta-blocking , 2014, Proc. VLDB Endow..

[12]  PalpanasThemis,et al.  The return of jedAI , 2018, VLDB 2018.

[13]  VassilisChristophides,et al.  Entity Resolution in the Web of Data , 2015 .

[14]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[15]  Qing Wang,et al.  A Clustering-Based Framework to Control Block Sizes for Entity Resolution , 2015, KDD.

[16]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[17]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[18]  George Papadakis,et al.  JedAI: The Force Behind Entity Resolution , 2017, ESWC.

[19]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[20]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[21]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[22]  George Papastefanatos,et al.  Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data , 2015, Proc. VLDB Endow..

[23]  Divesh Srivastava,et al.  Robust Entity Resolution using Random Graphs , 2018, SIGMOD Conference.

[24]  Georgios Paliouras,et al.  Representation models for text classification: a comparative analysis over three web document types , 2012, WIMS '12.

[25]  Wolfgang Nejdl,et al.  Efficient entity resolution methods for heterogeneous information spaces , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[26]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[27]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[28]  George Papadakis,et al.  The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data , 2018, Proc. VLDB Endow..

[29]  Vasilis Efthymiou,et al.  Entity resolution in the web of data , 2013, Entity Resolution in the Web of Data.

[30]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[31]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[32]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.