The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data

JedAI is an Entity Resolution toolkit that can be used in three ways: (i) as an open-source library that combines stateof-the-art methods into a plethora of end-to-end workflows, (ii) as a user-friendly desktop application with a wizardlike interface that provides complex, out-of-the-box solutions even to lay users, and (iii) as a workbench for comparing the performance of numerous workflows over both structured and semi-structured data. Here, we present its significant upgrade, JedAI 2.0, which enhances the original version in three important respects: (i) time efficiency, as the running time has been drastically reduced with the use of high performance data structures and multi-core processing, (ii) effectiveness, since we enriched its library with more established methods, a new layer that exploits loose schema binding as well as the automatic, data-driven configuration of individual methods or entire workflows, and (iii) usability, as the GUI now enables users to manually configure any method based on concrete guidelines, to store the matching results into any of the supported data formats and to visually explore both input and output data. PVLDB Reference Format: George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. The return of JedAI. PVLDB, 11 (12): 1950 1953, 2018. DOI: https://doi.org/10.14778/3229863.3236232

[1]  Peter A. Boncz,et al.  Deriving an Emergent Relational Schema from RDF Data , 2015, WWW.

[2]  AnHai Doan,et al.  Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks , 2016, Proc. VLDB Endow..

[3]  George Papastefanatos,et al.  Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data , 2015, Proc. VLDB Endow..

[4]  Flavius Frasincar,et al.  Duplicate detection in web shops using LSH to reduce the number of computations , 2016, SAC.

[5]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[6]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[7]  Kim Schouten,et al.  A Data Type-Driven Property Alignment Framework for Product Duplicate Detection on the Web , 2016, WISE.

[8]  Flavius Frasincar,et al.  Multi-component similarity method for web product duplicate detection , 2015, SAC.

[9]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[10]  George Papadakis,et al.  Multi-core Meta-blocking for Big Linked Data , 2017, SEMANTiCS.

[11]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[12]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Achille Fokoue,et al.  Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing , 2012, SEMWEB.

[14]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[15]  Michael Stonebraker,et al.  SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints , 2017, Proc. VLDB Endow..

[16]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[17]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[18]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[19]  George Papadakis,et al.  JedAI: The Force Behind Entity Resolution , 2017, ESWC.

[20]  Dezhao Song Scalable and Domain-Independent Entity Coreference: Establishing High Quality Data Linkages across Heterogeneous Data Sources , 2012, International Semantic Web Conference.

[21]  Qing Wang,et al.  A Clustering-Based Framework to Control Block Sizes for Entity Resolution , 2015, KDD.

[22]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[23]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.