Data Deduplication : A Review

The wide exploitation of new techniques and systems for generating, collecting and storing data has made available growing volumes of information. Large quantities of such information are stored as free texts. The lack of explicit structure in free text is a major issue in the categorization of such kind of data for more effective and efficient information retrieval, search and filtering. The abundance of structured data is problematic too. Several databases are available, that contain data of the same type. Unfortunately, they often conform to different schemas, which avoids the unified management of even structured information. The Entity Resolution process plays a fundamental role in the context of information integration and management, aimed to infer a uniform and common structure from various largescale data collections, with which to suitably organize, match and consolidate the information of the individual repositories into one data set. De-duplication is a key step of the Entity Resolution process, whose goal is discovering duplicates within the integrated data, i.e., different tuples that, as a matter of facts, refer to the same real-world entity. This attenuates the redundancy of the integrated data and, also, enables more effective information handling and knowledge extraction through a unified access to reconciled and de-duplicated data. Duplicate detection is an active research area that benefits from contributions from diverse research fields, such as, machine learning, data mining and knowledge discovery, databases as well as information retrieval and extraction. This chapter presents an overview of research on data de-duplication, with the goal of providing a general understanding and Gianni Costa · Alfredo Cuzzocrea · Giuseppe Manco · Riccardo Ortale ICAR-CNR, Via P. Bucci, 41C, 87036 Rende (CS) Italy e-mail: {costa,cuzzocrea,manco,ortale}@icar.cnr.it M. Biba and F. Xhafa (Eds.): Learning Structure and Schemas from Documents, SCI 375, pp. 385–412. springerlink.com c © Springer-Verlag Berlin Heidelberg 2011