Cloud-Scale Entity Resolution: Current State and Open Challenges

Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field.

[1]  Yasin N. Silva,et al.  MapReduce-based similarity join for metric spaces , 2012, Cloud-I '12.

[2]  Xiao Chen Crowdsourcing Entity Resolution: a Short Overview and Open Issues , 2015, GvD.

[3]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[4]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[5]  Bo Yang,et al.  Parallel NoSQL Entity Resolution Approach with MapReduce , 2015, 2015 International Conference on Intelligent Networking and Collaborative Systems.

[6]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[7]  S. Vasavi,et al.  Hadoop Framework For Entity Resolution Within High Velocity Streams , 2016 .

[8]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[9]  George Papastefanatos,et al.  Parallel meta-blocking for scaling entity resolution over big heterogeneous data , 2017, Inf. Syst..

[10]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[11]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[12]  Carlos Eduardo S. Pires,et al.  Adaptive sorted neighborhood blocking for entity matching with MapReduce , 2015, SAC.

[13]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[14]  Peng Wang,et al.  An efficient MapReduce algorithm for similarity join in metric spaces , 2016, The Journal of Supercomputing.

[15]  Kostas Tzoumas,et al.  Introduction to Apache Flink: Stream Processing for Real Time and Beyond , 2016 .

[16]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[17]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[18]  Andreas Thor,et al.  Learning-based entity resolution with MapReduce , 2011, CloudDB '11.

[19]  Carlos Eduardo S. Pires,et al.  Improving load balancing for MapReduce-based entity matching , 2013, 2013 IEEE Symposium on Computers and Communications (ISCC).

[20]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[21]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[22]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[23]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[24]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[25]  Avigdor Gal Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial , 2014, VLDB 2014.

[26]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[27]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[28]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[29]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[30]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[31]  Jiajin Le,et al.  An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark , 2015 .

[32]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[33]  Carlos Alberto Heuser,et al.  A fast approach for parallel deduplication on multicore processors , 2011, SAC '11.

[34]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[35]  Erhard Rahm,et al.  Data Partitioning for Parallel Entity Matching , 2010, ArXiv.

[36]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[37]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[38]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[39]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[40]  Bo Yang,et al.  Large-Scale Schema-Free Data Deduplication Approach with Adaptive Sliding Window Using MapReduce , 2015, Comput. J..

[41]  Marcos Barreto,et al.  A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data , 2015, EDBT/ICDT Workshops.

[42]  Thomas Seidl,et al.  MR-DSJ: Distance-Based Self-Join for Large-Scale Vector Data Analysis with MapReduce , 2013, BTW.

[43]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[44]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[45]  Felix Naumann,et al.  Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[46]  Douglas W. Oard,et al.  Improving text classification for oral history archives with temporal domain knowledge , 2007, SIGIR.

[47]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[48]  Andreas Thor,et al.  Block-based load balancing for entity resolution with MapReduce , 2011, CIKM '11.

[49]  Carlos Eduardo S. Pires,et al.  An efficient spark-based adaptive windowing for entity matching , 2017, J. Syst. Softw..

[50]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[51]  Wagner Meira,et al.  A Scalable Parallel Deduplication Algorithm , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[52]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[53]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[54]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[55]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[56]  Andreas Thor,et al.  Don't match twice: redundancy-free similarity computation with MapReduce , 2013, DanaC '13.

[57]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[58]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[59]  Stuart J. Russell,et al.  Object Identification: A Bayesian Analysis with Application to Traffic Surveillance , 1998, Artif. Intell..

[60]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[61]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[62]  Ming-Yen Lin,et al.  A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys , 2014 .

[63]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[64]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[65]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[66]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[67]  Guoliang Li,et al.  Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints , 2013, EDBT '13.

[68]  Felix Naumann,et al.  Scalable Iterative Graph Duplicate Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[69]  Erhard Rahm,et al.  Parallel Entity Resolution with Dedoop , 2012, Datenbank-Spektrum.

[70]  Guoqiang Li,et al.  Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data , 2017, The Journal of Supercomputing.

[71]  Hakan Kardes,et al.  Graph-based Approaches for Organization Entity Resolution in MapReduce , 2013, TextGraphs@EMNLP.

[72]  Hector Garcia-Molina,et al.  P-Swoosh: Parallel Algorithm for Generic Entity Resolution , 2006 .

[73]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[74]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[75]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[76]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[77]  Andrew Borthwick,et al.  Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce , 2012 .

[78]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[79]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[80]  Yuan Xue,et al.  Scalable load balancing for mapreduce-based record linkage , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[81]  Xiaoyong Du,et al.  Efficient Duplicate Detection on Cloud Using a New Signature Scheme , 2011, WAIM.

[82]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[83]  Thomas Seidl,et al.  PHiDJ: Parallel similarity self-join for high-dimensional vector data with MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[84]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[85]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[86]  Vasilis Efthymiou,et al.  Entity resolution in the web of data , 2013, Entity Resolution in the Web of Data.

[87]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[88]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.