New unsupervised clustering algorithm for large datasets

A fast and accurate unsupervised clustering algorithm has been developed for clustering very large datasets. Though designed for very large volumes of geospatial data, the algorithm is general enough to be used in a wide variety of domain applications. The number of computations the algorithm requires is ~ O(N), and thus faster than hierarchical algorithms. Unlike the popular K-means heuristic, this algorithm does not require a series of iterations to converge to a solution. In addition, this method does not depend on initialization of a given number of cluster representatives, and so is insensitive to initial conditions. Being unsupervised, the algorithm can also "rank" each cluster based on density. The method relies on weighting a dataset to grid points on a mesh, and using a small number of rule-based agents to find the high density clusters. This method effectively reduces large datasets to the size of the grid, which is usually many orders of magnitude smaller. Numerical experiments are shown that demonstrate the advantages of this algorithm over other techniques.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[3]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[4]  S. Arono,et al.  Geographic Information Systems: A Management Perspective , 1989 .

[5]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[6]  C. Birdsall,et al.  Plasma Physics via Computer Simulation , 2018 .

[7]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[8]  Michael E. Houle,et al.  Robust Distance-Based Clustering with Applications to Spatial Data Mining , 2001, Algorithmica.

[9]  Fionn Murtagh,et al.  Comments on 'Parallel Algorithms for Hierarchical Clustering and Cluster Validity' , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Albert Y. Zomaya,et al.  Parallel and distributed computing for data mining , 1999, IEEE Concurr..

[11]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[12]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[13]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[14]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[15]  Subhash Suri,et al.  Finding tailored partitions , 1989, SCG '89.

[16]  F. Murtagh,et al.  Multivariate Data Analysis , 1986 .