Fuzzy c-Means Algorithms for Very Large Data

Very large (VL) data or big data are any data that you cannot load into your computer's working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks used in the pattern recognition and data mining communities to search VL databases (including VL images) in various applications, and so, clustering algorithms that scale well to VL data are important and useful. This paper compares the efficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data. Specifically, we compare methods that are based on 1) sampling followed by noniterative extension; 2) incremental techniques that make one sequential pass through subsets of the data; and 3) kernelized versions of FCM that provide approximations based on sampling, including three proposed algorithms. We use both loadable and VL datasets to conduct the numerical experiments that facilitate comparisons based on time and space complexity, speed, quality of approximations to batch FCM (for loadable data), and assessment of matches between partitions and ground truth. Empirical results show that random sampling plus extension FCM, bit-reduced FCM, and approximate kernel FCM are good choices to approximate FCM for VL data. We conclude by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommendations regarding the use of different VL FCM clustering schemes.

[1]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Lawrence O. Hall,et al.  Fast clustering with application to fuzzy rule generation , 1995, Proceedings of 1995 IEEE International Conference on Fuzzy Systems..

[3]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[4]  Lawrence O. Hall,et al.  A Scalable Framework For Segmenting Magnetic Resonance Images , 2009, J. Signal Process. Syst..

[5]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[6]  John F. Kolen,et al.  Reducing the time complexity of the fuzzy c-means algorithm , 2002, IEEE Trans. Fuzzy Syst..

[7]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Ratko Orlandic,et al.  Clustering high-dimensional data using an efficient and effective data space reduction , 2005, CIKM '05.

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  R.J. Hathaway,et al.  Kernelized Non-Euclidean Relational Fuzzy c-Means Algorithm , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[11]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[12]  Hichem Frigui,et al.  Simultaneous Clustering and Feature Discrimination with Applications , 2007 .

[13]  Carl J. Schmidt,et al.  GoFigure: Automated Gene OntologyTM annotation , 2003, Bioinform..

[14]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[15]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Mohamed-Ali Belabbas,et al.  Spectral methods in machine learning and new strategies for very large datasets , 2009, Proceedings of the National Academy of Sciences.

[17]  Rong Jin,et al.  Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[18]  Edward A. Fox,et al.  Incremental Clustering for Very Large Document Databases: Initial MARIAN Experience , 1995, Inf. Sci..

[19]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[20]  Zhongdong Wu,et al.  Fuzzy C-means clustering algorithm based on kernel method , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[21]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[22]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[23]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[24]  James M. Keller,et al.  Comparing Fuzzy, Probabilistic, and Possibilistic Partitions , 2010, IEEE Transactions on Fuzzy Systems.

[25]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[26]  Veit Schwämmle,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[27]  Lawrence O. Hall,et al.  Fast accurate fuzzy clustering through data reduction , 2003, IEEE Trans. Fuzzy Syst..

[28]  James C. Bezdek,et al.  Complexity reduction for "large image" processing , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[29]  James C. Bezdek,et al.  Relational duals of the c-means clustering algorithms , 1989, Pattern Recognit..

[30]  Lawrence O. Hall,et al.  Single Pass Fuzzy C Means , 2007, 2007 IEEE International Fuzzy Systems Conference.

[31]  Ameet Talwalkar,et al.  Sampling Techniques for the Nystrom Method , 2009, AISTATS.

[32]  James C. Bezdek,et al.  Extending fuzzy and probabilistic clustering to very large data sets , 2006, Comput. Stat. Data Anal..

[33]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[34]  Rong Jin,et al.  Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[35]  James C. Bezdek,et al.  Efficient Implementation of the Fuzzy c-Means Clustering Algorithms , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Ramakant Nevatia,et al.  Cluster Boosted Tree Classifier for Multi-View, Multi-Pose Object Detection , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[37]  Lawrence O. Hall,et al.  Convergence of the Single-Pass and Online Fuzzy C-Means Algorithms , 2011, IEEE Transactions on Fuzzy Systems.

[38]  Matthias E. Futschik,et al.  Noise-robust Soft Clustering of Gene Expression Time-course Data , 2005, J. Bioinform. Comput. Biol..

[39]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[40]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[41]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[42]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[43]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[44]  Liang Liao,et al.  A fast spatial constrained fuzzy kernel clustering algorithm for MRI brain image segmentation , 2007, 2007 International Conference on Wavelet Analysis and Pattern Recognition.

[45]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[46]  Thomas Seidl,et al.  Subspace clustering for indexing high dimensional data: a main memory index based on local reductions and individual multi-representations , 2011, EDBT/ICDT '11.