Clustering large datasets

A review of current techniques for clustering large quantitative data sets is presented. It is found that storing summaries of the original data in a tree improves the scalability of traditional methods to large problems. Density and grid-based techniques are introduced as offering similar scalability, whilst offering the user the ability to discover clusters of arbitrary shapes and an estimate of the number of clusters present in the data.

[1]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[2]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[7]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[8]  W. T. Williams,et al.  Dissimilarity Analysis: a new Technique of Hierarchical Sub-division , 1964, Nature.

[9]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[10]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[11]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[12]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  Paul Fearnhead,et al.  Particle filters for mixture models with an unknown number of components , 2004, Stat. Comput..

[14]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[15]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[16]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[17]  Robert E. Jensen,et al.  A Dynamic Programming Algorithm for Cluster Analysis , 1969, Oper. Res..

[18]  Jun S. Liu,et al.  Mixture Kalman filters , 2000 .

[19]  N. Chopin A sequential particle filter method for static models , 2002 .

[20]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[21]  M. Narasimha Murty,et al.  Growing subspace pattern recognition methods and their neural-network models , 1997, IEEE Trans. Neural Networks.

[22]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[23]  Hichem Frigui,et al.  Self-Organization of Pulse-Coupled Oscillators with Application to Clustering , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[25]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[26]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[27]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[28]  Nando de Freitas,et al.  Sequential Monte Carlo in Practice , 2001 .

[29]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[30]  Jun S. Liu,et al.  Sequential Monte Carlo methods for dynamic systems , 1997 .

[31]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[32]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[33]  Sargur N. Srihari,et al.  Fast k-nearest neighbor classification using cluster-based trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[35]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[36]  Peter Bryant,et al.  Asymptotic behaviour of classification maximum likelihood estimates , 1978 .

[37]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[38]  P. Fearnhead,et al.  Improved particle filter for nonlinear problems , 1999 .

[39]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[40]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[41]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[42]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[43]  Alejandro Murua,et al.  Hierarchical model-based clustering of large datasets through fractionation and refractionation , 2002, Inf. Syst..

[44]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[45]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .