Large-Scale Data Pollution with Apache Spark

Because of the increasing volume of autonomously collected data objects, duplicate detection is an important challenge in today's data management. To evaluate the efficiency of duplicate detection algorithms with respect to big data, large test data sets are required. Existing test data generation tools, however, are either not able to produce large test data sets or are domain-dependent which limits their usefulness to a few cases. In this paper, we describe a new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities. To prove its concept, we implemented a prototype which is built upon the cluster computing framework Apache Spark and evaluate its performance in several experiments.

[1]  Peter Christen Development and user experiences of an open source data cleaning, deduplication and record linkage system , 2009, SKDD.

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[4]  Peter Christen,et al.  Accurate Synthetic Generation of Realistic Personal Information , 2009, PAKDD.

[5]  Peter Christen,et al.  Flexible and extensible generation and corruption of personal data , 2013, CIKM.

[6]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[7]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[8]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[9]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[10]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Peter Christen,et al.  GeCo: an online personal data generator and corruptor , 2013, CIKM.

[12]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[15]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[16]  Anish Das Sarma,et al.  Data Cleaning: A Practical Perspective , 2013, Data Cleaning: A Practical Perspective.

[17]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[18]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[19]  Vijay V. Raghavan,et al.  NoSQL Systems for Big Data Management , 2014, 2014 IEEE World Congress on Services.

[20]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[21]  Norbert Ritter,et al.  Scalable data management: NoSQL data stores in research and practice , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).