Executing Entity Matching End to End: A Case Study

Entity matching (EM) identifies data instances that refer to the same real-world entity. Numerous EM works have covered a wide spectrum, from developing new EM algorithms to scaling them to building EM systems. But there has been very little if any published work on how EM is carried out in practice, end to end. In this paper we describe in detail a case study of applying EM to a particular domain end to end (i.e., going from the raw data all the way to the matches). Specifically, we describe a real-world application for EM in the science policy research community. We describe how our team (the EM team) interact with the science policy team to carry out the EM process, using PyMatcher, a state-of-the-art EM system developed in theMagellan project at UW-Madison. We highlight the communication between the two teams and the zig-zag nature of the EM process. We identify a set of challenges that we believe arise inmany real-world EM projects but that current EM systems have either ignored or are not even aware of. Finally, we provide all data underlying this case study, including labeled tuple pairs and documentation supplied by the science policy team, to serve as a good challenge problem for EM researchers.

[1]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[2]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  AnHai Doan,et al.  MatchCatcher: A Debugger for Blocking in Entity Matching , 2018, EDBT.

[4]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[5]  Ariel Fuxman,et al.  Matching unstructured product offers to structured product specifications , 2011, KDD.

[6]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[8]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[9]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[10]  C. Shen,et al.  Linkage of patient records from disparate sources , 2013, Statistical methods in medical research.

[11]  AnHai Doan,et al.  Entity Matching Meets Data Science: A Progress Report from the Magellan Project , 2019, SIGMOD Conference.

[12]  Data Matching , 2017, Encyclopedia of Machine Learning and Data Mining.

[13]  Wagner Meira,et al.  Entity Matching: A Case Study in the Medical Domain , 2015, AMW.

[14]  Louiqa Raschid,et al.  Financial Entity Identification and Information Integration (FEIII) Challenge: The Report of the Organizing Committee , 2016, DSMM@SIGMOD.

[15]  AnHai Doan,et al.  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing , 2014, Proc. VLDB Endow..

[16]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[17]  Olivier Bodenreider,et al.  An approximate matching method for clinical drug names. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[19]  Ashwin Machanavajjhala,et al.  An automatic blocking mechanism for large-scale de-duplication tasks , 2012, CIKM '12.

[20]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[22]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[23]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.