Contraption of Suffix Array Blocking for Efficacious Record Linkage and De-duplication

Information is united for common purpose from many sidedness computerized files is referred as record linkage. The basic methods compare name and address information across pairs of files to determine those pairs of records that are associated with the same entity. An entity might be a business, a person, or some other type of unit that is listed. De-duplication is a scold of identifying one or more records in receptacle which represents same object or entity. The same data may be depicting in different way in all possible database causing problem. Diverse indexing techniques have been elaborated for record linkage and de-duplication, in modern time. They are intended to reducing the number of record pairs to be compared in similarity matching process, while at the same time maintaining high matching quality. This paper presents, contraption of suffix array blocking for efficacious record linkage and de-duplication based on different similarity measures. General Terms Indexing methods, Record classification

[1]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[2]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[3]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[4]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[5]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[6]  Sanjay Chawla,et al.  Robust record linkage blocking using suffix arrays , 2009, CIKM.

[7]  M. Goldacre,et al.  Computerised linking of medical records: methodological guidelines. , 1993, Journal of epidemiology and community health.

[8]  J. T. Marshall Canada's national vital statistics index , 1947 .

[9]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[10]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[12]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[14]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[15]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[16]  Peter Christen Towards Parameter-free Blocking for Scalable Record Linkage , 2007 .