A comparison study of similarity measures for covering-based neighborhood classifiers

Abstract In data mining, neighborhood classifiers are valid not only for numeric data but also symbolic data. The key issue for a neighborhood classifier is how to measure the similarity between two instances. In this paper, we compare six similarity measures, Overlap, Eskin, occurrence frequency (OF), inverse OF (IOF), Goodall3, and Goodall4, for symbolic data under the framework of a covering-based neighborhood classifier. In the training stage, a covering of the universe is built based on the given similarity measure. Then a covering reduction algorithm is used to remove some of these covering blocks and determine the representatives. In the testing stage, the similarities between all unlabeled instances and representatives are computed. The closest representative or a few representatives determine the predicted class label of the unlabeled instance. We compared the six similarity measures in experiments on 15 University of California-Irvine (UCI) datasets. The results demonstrate that although no measure dominated the others in all scenarios, some measures had consistently high performance. The covering-based neighborhood classifier with appropriate similarity measures, such as Overlap, IOF, and OF, was better than ID3, C4.5, and the Naive Bayes classifiers.

[1]  Eric C. C. Tsang,et al.  Neighborhood collaborative classifiers , 2016, 2016 International Conference on Machine Learning and Cybernetics (ICMLC).

[2]  Hong Zhao,et al.  Optimal cost-sensitive granularization based on rough sets for variable costs , 2014, Knowl. Based Syst..

[3]  Weihua Xu,et al.  Double-quantitative decision-theoretic rough set , 2015, Inf. Sci..

[4]  Qinghua Hu,et al.  Hierarchical Feature Selection with Recursive Regularization , 2017, IJCAI.

[5]  Xinye Cai,et al.  Neighborhood based decision-theoretic rough set models , 2016, Int. J. Approx. Reason..

[6]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[7]  Qingxin Zhu,et al.  Characteristic matrix of covering and its application to Boolean matrix decomposition , 2012, Inf. Sci..

[8]  Sam Kwong,et al.  Fuzzy-Rough-Set-Based Active Learning , 2014, IEEE Transactions on Fuzzy Systems.

[9]  S. Paul,et al.  Clustering analysis in social network using Covering Based Rough Set , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[10]  William Zhu,et al.  Closed-set lattice and modular matroid induced by covering-based rough sets , 2017, Int. J. Mach. Learn. Cybern..

[11]  Huangjian Yi,et al.  Generalized three-way decision models based on subset evaluation , 2017, Int. J. Approx. Reason..

[12]  Usiobaifo Agharese Rosemary,et al.  Diabetes Diagnosis Model Using Rough Set and K- Nearest Neighbor Classifier , 2016 .

[13]  Fan Min,et al.  Representative-based classification through covering-based neighborhood rough sets , 2015, Applied Intelligence.

[14]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[15]  Yuhua Qian,et al.  Test-cost-sensitive attribute reduction , 2011, Inf. Sci..

[16]  Caihui Liu,et al.  On multi-granulation covering rough sets , 2014, Int. J. Approx. Reason..

[17]  Ahmad Taher Azar,et al.  Covering-based rough set classification system , 2016, Neural Computing and Applications.

[18]  Qingguo Li,et al.  Reduction about approximation spaces of covering generalized rough sets , 2010, Int. J. Approx. Reason..

[19]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[20]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[21]  Yiyu Yao,et al.  MGRS: A multi-granulation rough set , 2010, Inf. Sci..

[22]  Gianpiero Cattaneo,et al.  Information Entropy and Granulation Co-Entropy of Partitions and Coverings: A Summary , 2009, Trans. Rough Sets.

[23]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[24]  Min Wang,et al.  Active learning through density clustering , 2017, Expert Syst. Appl..

[25]  Shulin Wang,et al.  Neighborhood Rough Set Model Based Gene Selection for Multi-subtype Tumor Classification , 2008, ICIC.

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  B. K. Tripathy,et al.  An integrated covering-based rough fuzzy set clustering approach for sequential data , 2015, Int. J. Reason. based Intell. Syst..

[28]  Tianrui Li,et al.  Dominance-Based Neighborhood Rough Sets and Its Attribute Reduction , 2015, RSKT.

[29]  Bing Shi,et al.  Regression-based three-way recommendation , 2017, Inf. Sci..

[30]  Liwen Ma,et al.  Two fuzzy covering rough set models and their generalizations over fuzzy lattices , 2016, Fuzzy Sets Syst..

[31]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[32]  Qinghua Hu,et al.  Neighborhood classifiers , 2008, Expert Syst. Appl..

[33]  Yuhua Qian,et al.  A comparative study of multigranulation rough sets and concept lattices via rule acquisition , 2016, Knowl. Based Syst..

[34]  Yiyu Yao,et al.  Covering based rough set approximations , 2012, Inf. Sci..

[35]  Qinghua Hu,et al.  Neighborhood rough set based heterogeneous feature subset selection , 2008, Inf. Sci..

[36]  Yiyu Yao,et al.  Rough sets, neighborhood systems and granular computing , 1999, Engineering Solutions for the Next Millennium. 1999 IEEE Canadian Conference on Electrical and Computer Engineering (Cat. No.99TH8411).

[37]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[38]  William Zhu,et al.  Attribute reduction of data with error ranges and test costs , 2012, Inf. Sci..

[39]  Wei-Ying Ma,et al.  Learning similarity measure for natural image retrieval with relevance feedback , 2002, IEEE Trans. Neural Networks.

[40]  W. Zakowski APPROXIMATIONS IN THE SPACE (U,π) , 1983 .

[41]  Yao Ping,et al.  Neighborhood rough set and SVM based hybrid credit scoring classifier , 2011 .

[42]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[43]  Wanlu Li,et al.  On measurements of covering rough sets based on granules and evidence theory , 2015, Inf. Sci..

[44]  Franz Schweiggert,et al.  TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity , 2012, WIDM '12.

[45]  Xiao-Ping Zhang,et al.  Weighted Neighborhood Classifier for the Classification of Imbalanced Tumor Dataset , 2010, J. Circuits Syst. Comput..

[46]  Xizhao Wang,et al.  Building a Rule-Based Classifier—A Fuzzy-Rough Set Approach , 2010, IEEE Transactions on Knowledge and Data Engineering.

[47]  Bing Huang,et al.  Cost-Sensitive Classification Based on Decision-Theoretic Rough Set Model , 2012, RSKT.

[48]  Gianpiero Cattaneo,et al.  Entropies and Co-Entropies of Coverings with Application to Incomplete Information Systems , 2007, Fundam. Informaticae.

[49]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[50]  Mingjie Cai,et al.  Knowledge reduction of dynamic covering decision information systems when varying covering cardinalities , 2016, Inf. Sci..

[51]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[52]  Tianrui Li,et al.  Composite rough sets for dynamic data mining , 2014, Inf. Sci..

[53]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[54]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[55]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[56]  Yiyu Yao,et al.  Actionable strategies in three-way decisions , 2017, Knowl. Based Syst..

[57]  Qinghua Hu,et al.  A novel method for attribute reduction of covering decision systems , 2014, Inf. Sci..

[58]  William Zhu,et al.  Topological approaches to covering rough sets , 2007, Inf. Sci..