task of recognizing, in a data warehouse, records that pass on to the identical real world entity despite misspelling words, kinds, special writing styles or even unusual schema versions or data types is called as the record deduplication. In existing research they offered a genetic programming (GP) approach to record deduplication. Their approach combines several different parts of substantiation extracted from the data content to generate a deduplication purpose that is capable to recognize whether two or more entries in a depository are duplications or not. Because record deduplication is a time intense task even for undersized repositories, their aspire is to promote a method that discovers a proper arrangement of the best pieces of confirmation, consequently compliant a deduplication function that maximizes performance using a small representative portion of the corresponding data for preparation purposes also the optimization of process is less. Our research deals these issues with a novel technique called modified bat algorithm for record duplication. The incentive behind is to generate a flexible and effective method that employs Data Mining algorithms. The structure distributes many similarities with evolutionary computation techniques such as Genetic programming approach. This scheme is initialized with an inhabitant of random solutions and explores for optima by updating bat inventions. Nevertheless, disparate GP, modified bat has no development operators such as crossover and mutation. We also compare the proposed algorithm with other existing algorithms, including GP from the experimental results.
[1]
Michal Kaczmarczyk,et al.
HYDRAstor: A Scalable Secondary Storage
,
2009,
FAST.
[2]
Youjip Won,et al.
Efficient Deduplication Techniques for Modern Backup Operation
,
2011,
IEEE Transactions on Computers.
[3]
Panagiotis G. Ipeirotis,et al.
Duplicate Record Detection: A Survey
,
2007
.
[4]
Cezary Dubnicki,et al.
HydraFS: A High-Throughput File System for the HYDRAstor Content-Addressable Storage System
,
2010,
FAST.
[5]
Dutch T. Meyer,et al.
A study of practical deduplication
,
2011,
TOS.
[6]
Sunita Sarawagi,et al.
Integrating Unstructured Data into Relational Databases
,
2006,
22nd International Conference on Data Engineering (ICDE'06).
[7]
Marcos André Gonçalves,et al.
A Genetic Programming Approach to Record Deduplication
,
2012,
IEEE Transactions on Knowledge and Data Engineering.
[8]
Xin-She Yang,et al.
A New Metaheuristic Bat-Inspired Algorithm
,
2010,
NICSO.
[9]
Divesh Srivastava,et al.
Record linkage: similarity measures and algorithms
,
2006,
SIGMOD Conference.
[10]
Louise E. Moser,et al.
Extracting data records from the web using tag path clustering
,
2009,
WWW '09.