Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy

Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.

[1]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[2]  Chieh-Yuan Tsai,et al.  Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[3]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[4]  Christian Sohler,et al.  Theoretical Analysis of the k-Means Algorithm - A Survey , 2016, Algorithm Engineering.

[5]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[7]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[8]  Zhenzhou Lu,et al.  Variable importance analysis: A comprehensive review , 2015, Reliab. Eng. Syst. Saf..

[9]  W. R. Dillon,et al.  On the Use of Component Scores in the Presence of Group Structure , 1989 .

[10]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Minh Le Nguyen,et al.  Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems , 2016, Applied Intelligence.

[12]  Elsayed A. Sallam,et al.  A hybrid network intrusion detection framework based on random forests and weighted k-means , 2013 .

[13]  Douglas Steinley,et al.  Standardizing Variables in K -means Clustering , 2004 .

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Robert M. Haralick,et al.  Feature normalization and likelihood-based similarity measures for image retrieval , 2001, Pattern Recognit. Lett..

[16]  Yongtao Hao,et al.  A feature weighted support vector machine and K-nearest neighbor algorithm for stock market indices prediction , 2017, Expert Syst. Appl..

[17]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[18]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .

[19]  Jia Wu,et al.  A Correlation-Based Feature Weighting Filter for Naive Bayes , 2019, IEEE Transactions on Knowledge and Data Engineering.

[20]  Shengrui Wang,et al.  Particle swarm optimizer for variable weighting in clustering high-dimensional data , 2009, 2009 IEEE Swarm Intelligence Symposium.

[21]  Hüseyin Gürüler,et al.  A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method , 2017, Neural Computing and Applications.

[22]  Simon Fong,et al.  The Impact of Data Normalization on Stock Market Prediction: Using SVM and Technical Indicators , 2016, SCDS.

[23]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.