A Design of Experiments Comparative Study on Clustering Methods

Cluster analysis is a multivariate data mining technique that is widely used in several areas. It aims to group automatically the $n$ elements of a database into $k$ clusters, using only the information of the variables of each case. However, the accuracy of the final clusters depends on the clustering method used. In this paper, we present an evaluation of the performance of main methods for cluster analysis as Ward, K-means, and Self-Organizing Maps. Differently from many studies published in the area, we generated the datasets using the Design of Experiment (DOE) technique, in order to achieve reliable conclusions about the methods through the generalization of the different possible data structures. We considered the number of variables and clusters, dataset size, sample size, cluster overlapping, and the presence of outliers, as the DOE factors. The datasets were analyzed by each clustering method and the clustering partitions were compared by the Attribute Agreement Analysis, providing invaluable information about the effects of the considered factors individually and about their interactions. The results showed that, the number of clusters, overlapping, and the interaction between sample size and number of variable significantly affect all the studied methods. Moreover, it is possible to state that the methods have similar performances, with a significance level of 5%, and it is not possible to affirm that one outperforms the others.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  D. Staiculescu,et al.  Design and optimization of 3-D compact stripline and microstrip Bluetooth/WLAN balun architectures using the design of experiments technique , 2005, IEEE Transactions on Antennas and Propagation.

[3]  Ralph Riedel,et al.  Cluster analysis as a method for the planning of production systems , 2009, 2009 International Conference on Computers & Industrial Engineering.

[4]  Anifatul Faricha,et al.  Artificial Neural Network for Diffraction Based Overlay Measurement , 2016, IEEE Access.

[5]  Niels G. Waller,et al.  A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms , 1998 .

[6]  Matthias Meyer,et al.  Opening the ‘black box’ of simulations: increased transparency and effective communication through the systematic design of experiments , 2011, Computational and Mathematical Organization Theory.

[7]  Sueli Aparecida Mingoti,et al.  Comparing SOM neural network with Fuzzy c , 2006, Eur. J. Oper. Res..

[8]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[9]  Y. Ahn,et al.  Classification of attempted suicide by cluster analysis: A study of 888 suicide attempters presenting to the emergency department. , 2018, Journal of affective disorders.

[10]  L. Tieu,et al.  Residential patterns in older homeless adults: Results of a cluster analysis. , 2016, Social science & medicine.

[11]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[12]  Glenn Fung,et al.  A Comprehensive Overview of Basic Clustering Algorithms , 2001 .

[13]  Michał Trzęsiok Measuring the Quality of Multivariate Statistical Models , 2018 .

[14]  Daniel M. McNeish Challenging Conventional Wisdom for Multivariate Statistical Models With Small Samples , 2017 .

[15]  Varghese S. Jacob,et al.  A study of the classification capabilities of neural networks using unsupervised learning: A comparison withK-means clustering , 1994 .

[16]  Stefano Riemma,et al.  Clustering methods for production planning and scheduling in a flexible manufacturing system , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[17]  L. Dascalescu,et al.  Using Design of Experiments and Virtual Instrumentation to Evaluate the Tribocharging of Pulverulent Materials in Compressed-Air Devices , 2008, IEEE Transactions on Industry Applications.

[18]  Jim Rutherford,et al.  Planning, Construction, and Statistical Analysis of Comparative Experiments , 2005, Technometrics.

[19]  Do-Hyun Kang,et al.  Tooth shape Optimization for Cogging Torque Reduction of Transverse Flux Rotary Motor using Design of Experiment and Response Surface Methodology , 2006, 2006 12th Biennial IEEE Conference on Electromagnetic Field Computation.

[20]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[21]  J. Hochdörffer,et al.  Product variety management using data-mining methods — Reducing planning complexity by applying clustering analysis on product portfolios , 2017, 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM).

[22]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[23]  David West,et al.  A comparison of SOM neural network and hierarchical clustering methods , 1996 .

[24]  Peter W. Laird,et al.  A comparison of cluster analysis methods using DNA methylation data , 2004, Bioinform..

[25]  R. Rosenthal Parametric measures of effect size. , 1994 .

[26]  Nelson F. F. Ebecken,et al.  A genetic algorithm for cluster analysis , 2003, Intell. Data Anal..

[27]  Jane You,et al.  Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[29]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[30]  Xiaoxiang Ma,et al.  Multivariate space-time modeling of crash frequencies by injury severity levels , 2017 .