A new design of ensemble classifiers for high-dimension entity resolution

For high-dimension Entity Resolution (ER), a new design of ensemble classifiers based on feature selection is proposed, which regards ER as a binary classification problem. Binary classifier’s classification performance and similarity measurement are defined, Support Vector Machine is adopted as the base classifier. Classification accuracy, output dissimilarity of classifiers and feature subset’s number are used as optimization objects. Feature similarity vector of two candidate records is calculated as input data. Based on ER’s characteristics, the multiobjective problem is translated into a single objective optimization and graph-based ant colony optimization to solve it. The proposed method is validated by high-dimension datasets. • Technologies for IQ improvement and assurance➝Data scrubbing and cleansing Additional

[1]  Li Feng High precision method for text feature selection based on improved ant colony optimization algorithm , 2010 .

[2]  Eric McDermid,et al.  Entity resolution using inferred relationships and behavior , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Huan Liu,et al.  Feature Selection Strategy in Text Classification , 2011, PAKDD.

[4]  Hanmin Jung,et al.  Author Name Disambiguation in Technology Trend Analysis Using SVM and Random Forests and Novel Topic Based Features , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[5]  Wang Yan-xia Graph-based Ant System for Subset Problems , 2008 .

[6]  Anil Pahwa,et al.  AdaBoost$^{+}$: An Ensemble Learning Approach for Estimating Weather-Related Outages in Distribution Systems , 2014, IEEE Transactions on Power Systems.

[7]  Gaurav Pandey,et al.  A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics , 2013, 2013 IEEE 13th International Conference on Data Mining.

[8]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[9]  Diao Xingchun,et al.  A High Accurate Multiple Classifier System for Entity Resolution Using Resampling and Ensemble Selection , 2015 .

[10]  Yan Zhang,et al.  A confidence-based entity resolution approach with incomplete information , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[11]  Cherif Salama,et al.  A hybrid cross-language name matching technique using novel modified Levenshtein Distance , 2015, 2015 Tenth International Conference on Computer Engineering & Systems (ICCES).

[12]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Marco Laumanns,et al.  Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..