A Distributed Clustering Approach for Heterogeneous Environments Using Fuzzy Rough Set Theory

Vast majority of data mining algorithms have been designed to work on centralized data, unfortunately however, almost all of nowadays data sets are distributed both geographically and conceptually. Due to privacy and computation cost, centralizing distributed data sets before analyzing them is undoubtedly impractical. In this paper, we present a framework for clustering distributed data which takes into account privacy and computation cost. To do that, we remove uncertain instances and just send the label of the other instances to the central location. To remove the uncertain instances, we develop a new instance weighting method based on fuzzy and rough set theory. The achieved results on well-known data verify effectiveness of the proposed method compared to previous works.

[1]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[2]  Soojung Lee,et al.  Improving Jaccard Index for Measuring Similarity in Collaborative Filtering , 2017, ICISA.

[3]  C. Thilagavathy,et al.  A note on rough set theory , 2011, 2011 3rd International Conference on Electronics Computer Technology.

[4]  Theresa Beaubouef,et al.  Rough Sets , 2019, Lecture Notes in Computer Science.

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Ricardo J. G. B. Campello,et al.  Hierarchical Density-Based Clustering Using MapReduce , 2019, IEEE Transactions on Big Data.

[7]  Jacek M. Zurada,et al.  Normalized Mutual Information Feature Selection , 2009, IEEE Transactions on Neural Networks.

[8]  Dimitris K. Tasoulis,et al.  Unsupervised distributed clustering , 2004, Parallel and Distributed Computing and Networks.

[9]  Haibo He,et al.  A local density-based approach for outlier detection , 2017, Neurocomputing.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Luo Qi,et al.  Parallel and Distributed Computing and Networks , 2011 .

[12]  Qinghua Hu,et al.  Fuzzy Rough Set Based Feature Selection for Large-Scale Hierarchical Classification , 2019, IEEE Transactions on Fuzzy Systems.

[13]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[14]  Maria-Florina Balcan,et al.  General and Robust Communication-Efficient Algorithms for Distributed Clustering , 2017, ArXiv.

[15]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[16]  M. Tahar Kechadi,et al.  Distributed clustering algorithm for spatial data mining , 2015, 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM).

[17]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[18]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[19]  Craig A. Stow,et al.  Comparative analysis of discretization methods in Bayesian networks , 2017, Environ. Model. Softw..

[20]  Adriano Lorena Inácio de Oliveira,et al.  Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization , 2015, Expert Syst. Appl..

[21]  Carlos Soares,et al.  Entropy-based discretization methods for ranking data , 2016, Inf. Sci..

[22]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[23]  Chris Cornelis,et al.  International Journal of Approximate Reasoning Multi-adjoint Fuzzy Rough Sets: Definition, Properties and Attribute Selection , 2022 .

[24]  Ali H. Sayed,et al.  Distributed Clustering and Learning Over Networks , 2014, IEEE Transactions on Signal Processing.