Automatic Threshold Calculation for the Categorical Distance Measure ConDist

The measurement of distances between objects described by categorical attributes is a key challenge in data mining. The unsupervised distance measure ConDist approaches this challenge based on the idea that categorical values within an attribute are similar if they occur with similar value distributions on correlated context attributes. An impact function controls the inuence of the correlated context attributes in ConDist's distance calculation process. ConDist requires a user-dened threshold to purge context attributes whose correlations are caused by noisy, non-representative or small data sets. In this work, we propose an automatic threshold calculation method for each pair of attributes based on their value distributions and the num- ber of objects in the data set. Further, these thresholds are also consid- ered when applying ConDist's impact function. Experiments show that this approach is competitive with respect to well selected user-dened thresholds and superior to poorly selected user-dened thresholds.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[3]  Atul Negi,et al.  A survey of distance/similarity measures for categorical data , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[4]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[5]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[6]  Andreas Hotho,et al.  ConDist: A Context-Driven Categorical Distance Measure , 2015, ECML/PKDD.

[7]  Tu Bao Ho,et al.  for categorical data , 2005 .

[8]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[9]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[10]  Ruggero G. Pensa,et al.  Context-Based Distance Learning for Categorical Data Clustering , 2009, IDA.

[11]  Hong Jia,et al.  A new distance metric for unsupervised learning of categorical data , 2014, IEEE International Joint Conference on Neural Network.

[12]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Ali Hamzeh,et al.  CBDL: Context‐based distance learning for categorical attributes , 2011, Int. J. Intell. Syst..

[15]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[16]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.