Salt: Scalable Automated Linking Technology for Data-Intensive Computing

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business. The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage. A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data [17]. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

[1]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[2]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[3]  Borko Furht,et al.  Handbook of Cloud Computing , 2010 .

[4]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  Anthony M. Middleton Data-Intensive Technologies for Cloud Computing , 2010, Handbook of Cloud Computing.

[7]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[8]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[9]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[10]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[11]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[12]  Karl Branting A comparative evaluation of name-matching algorithms , 2003, ICAIL.

[13]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[14]  Eric R. Ziegel,et al.  Business survey methods , 1995 .

[15]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[16]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[17]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[18]  Vassilios S. Verykios,et al.  Record Matching: Past, Present and Future , 2001 .

[19]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[20]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[21]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[22]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.