Unsupervised String Transformation Learning for Entity Consolidation

Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods as well as Master Data Management (MDM) systems can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way. Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool.

[1]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[2]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[3]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[7]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[8]  Daniel Jurafsky,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2009, Prentice Hall series in artificial intelligence.

[9]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[10]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[11]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.

[12]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[13]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[14]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[15]  Wenfei Fan,et al.  Determining the relative accuracy of attributes , 2013, SIGMOD '13.

[16]  GulwaniSumit Automating string processing in spreadsheets using input-output examples , 2011 .

[17]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[18]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[19]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[20]  Yeye He,et al.  Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations , 2018, Proc. VLDB Endow..

[21]  Michael Stonebraker,et al.  Approximate String Joins with Abbreviations , 2017, Proc. VLDB Endow..

[22]  Sumit Gulwani,et al.  Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples , 2018, ICLR.

[23]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[24]  H. V. Jagadish,et al.  Foofah: Transforming Data By Example , 2017, SIGMOD Conference.

[25]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[26]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[27]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Michael Stonebraker,et al.  DataXFormer: A robust transformation discovery system , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[29]  Christopher Ré,et al.  SLiMFast: Guaranteed Results for Data Fusion and Source Reliability , 2015, SIGMOD Conference.

[30]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[31]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[32]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[33]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[34]  Ziqi Wang,et al.  A Probabilistic Approach to String Transformation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[35]  Laure Berti-Équille,et al.  Truth Discovery Algorithms: An Experimental Evaluation , 2014, ArXiv.

[36]  Michael Stonebraker,et al.  Dataxformer: Leveraging the Web for Semantic Transformations , 2015, CIDR.

[37]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[38]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[39]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[40]  AnHai Doan,et al.  Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks , 2016, Proc. VLDB Endow..

[41]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[42]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[43]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[44]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[45]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[46]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[47]  Armando Solar-Lezama,et al.  The Sketching Approach to Program Synthesis , 2009, APLAS.