论文信息 - Salt: Scalable Automated Linking Technology for Data-Intensive Computing

Salt: Scalable Automated Linking Technology for Data-Intensive Computing

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business. The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage. A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data [17]. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

Anthony M. Middleton | David Alan Bayliss

[1] Lifang Gu,et al. Record Linkage: Current Practice and Future Directions , 2003 .

[2] William E. Winkler,et al. Data quality and record linkage techniques , 2007 .

[3] Borko Furht,et al. Handbook of Cloud Computing , 2010 .

[4] H B NEWCOMBE,et al. Automatic linkage of vital records. , 1959, Science.

[5] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[6] Anthony M. Middleton. Data-Intensive Technologies for Cloud Computing , 2010, Handbook of Cloud Computing.

[7] Howard B. Newcombe,et al. Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[8] Ivan P. Fellegi,et al. A Theory for Record Linkage , 1969 .

[9] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[10] William W. Cohen,et al. Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[11] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.