A supervised clustering and classification algorithm for mining data with mixed variables

This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.

[1]  Robert A. Lordo,et al.  Learning from Data: Concepts, Theory, and Methods , 2001, Technometrics.

[2]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  G. S. Stiles,et al.  Fast full search equivalent encoding algorithms for image compression using vector quantization , 1992, IEEE Trans. Image Process..

[5]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[6]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[7]  Nong Ye,et al.  A scalable, incremental learning algorithm for classification problems , 2002 .

[8]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[9]  Qiang Chen,et al.  Probabilistic techniques for intrusion detection based on computer audit data , 2001, IEEE Trans. Syst. Man Cybern. Part A.

[10]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[11]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[12]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[13]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[14]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[15]  Nong Ye,et al.  Grid- and dummy-cluster-based learning of normal and intrusive clusters for computer intrusion detection , 2002 .

[16]  Tian Zhang Data clustering for very large datasets plus applications , 1997 .