Medoid-Shift for Noise Removal to Improve Clustering

We propose to use medoid-shift to reduce the noise in data prior to clustering. The method processes every point by calculating its k-nearest neighbors (k-NN), and then replacing the point by the medoid of its neighborhood. The process can be iterated. After the data cleaning process, any clustering algorithm can be applied that is suitable for the data.

[1]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Pasi Fränti Efficiency of random swap clustering , 2018, Journal of Big Data.

[3]  T. V. Pollet,et al.  To Remove or not to Remove: the Impact of Outlier Handling on Significance Testing in Testosterone Data , 2017 .

[4]  Moncef Gabbouj,et al.  Weighted median filters: a tutorial , 1996 .

[5]  Fabio Tozeto Ramos,et al.  On Integrated Clustering and Outlier Detection , 2014, NIPS.

[6]  Pasi Fränti,et al.  Centroid index: Cluster level similarity measure , 2014, Pattern Recognit..

[7]  Du-Ming Tsai,et al.  Mean Shift-Based Defect Detection in Multicrystalline Solar Wafer Surfaces , 2011, IEEE Transactions on Industrial Informatics.

[8]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[10]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[11]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  Takeo Kanade,et al.  Mode-seeking by Medoidshifts , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Tomi Kinnunen,et al.  Improving K-Means by Outlier Removal , 2005, SCIA.

[15]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[16]  Plamen Angelov,et al.  Anomalous behaviour detection based on heterogeneous data and data fusion , 2018, Soft Comput..

[17]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[18]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .