论文信息 - Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.

Gunter Saake | Gabriel Campero Durand | Eike Schallehn | Xiao Chen | Kirity Rapuru

[1] Holden Karau,et al. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark , 2017 .

[2] Avigdor Gal,et al. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[3] Gunter Saake,et al. Cloud-Scale Entity Resolution: Current State and Open Challenges , 2018, Open J. Big Data.

[4] Peter Christen,et al. Data Matching , 2012, Data-Centric Systems and Applications.

[5] William W. Cohen,et al. A Comparison of String Metrics for Matching Names and Records , 2003 .

[6] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[7] Marcos Barreto,et al. A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data , 2015, EDBT/ICDT Workshops.

[8] Peter Christen,et al. GeCo: an online personal data generator and corruptor , 2013, CIKM.

[9] Carlos Eduardo S. Pires,et al. An efficient spark-based adaptive windowing for entity matching , 2017, J. Syst. Softw..

[10] Gunter Saake,et al. Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability , 2018, BDAS.

[11] Chen Wang,et al. Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark , 2016, EDBT.

[12] Peter Christen,et al. Flexible and extensible generation and corruption of personal data , 2013, CIKM.