An iterative algorithm for the solution of very large-scale diameter clustering problems

We introduce an iterative algorithm for the solution of the diameter minimization clustering problem (DMCP). Our algorithm is based upon two observations: 1) subsets induce lower bounds on the value of the optimal solution of the original problem; and 2) there exists a subset whose optimal clustering has the same value as that of the original problem. We also describe how to adapt our algorithmic framework for the solution of other clustering problems, namely the minimum sum-of-diameters clustering problem (MSDCP), the split maximization clustering problem (SMCP) and the maximum sum-of-splits clustering problem (MSSCP). A parallel implementation of our algorithm can solve problems containing almost 600,000 entities while consuming only moderate amounts of time and memory. The size of the problems that can be solved using our algorithm is two orders of magnitude larger than the largest problems solved by the current state-of-the-art algorithm.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Jan Paul Siebert,et al.  Vehicle Recognition Using Rule Based Methods , 1987 .

[3]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[4]  Thi-Bich-Hanh Dao,et al.  Constrained clustering by constraint programming , 2017, Artif. Intell..

[5]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[6]  Pierre Hansen,et al.  An improved column generation algorithm for minimum sum-of-squares clustering , 2009, Math. Program..

[7]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[8]  C. Alpert,et al.  Splitting an Ordering into a Partition to Minimize Diameter , 1997 .

[9]  Seref Sagiroglu,et al.  The development of intuitive knowledge classifier and the modeling of domain dependent data , 2013, Knowl. Based Syst..

[10]  Denis J. Dean,et al.  Comparison of neural networks and discriminant analysis in predicting forest cover types , 1998 .

[11]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[12]  Pierre Hansen,et al.  Bicriterion Cluster Analysis , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Michael J Brusco,et al.  Bicriterion methods for partitioning dissimilarity matrices. , 2005, The British journal of mathematical and statistical psychology.

[14]  Patric R. J. Östergård,et al.  A fast algorithm for the maximum clique problem , 2002, Discret. Appl. Math..

[15]  M. Rao Cluster Analysis and Mathematical Programming , 1971 .

[16]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[17]  Yu Hen Hu,et al.  Vehicle classification in distributed sensor networks , 2004, J. Parallel Distributed Comput..

[18]  Hugo Fuks,et al.  Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements , 2012, SBIA.

[19]  P. Hansen,et al.  Complete-Link Cluster Analysis by Graph Coloring , 1978 .

[20]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[21]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[22]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[23]  M. Brusco,et al.  Branch-and-Bound Applications in Combinatorial Data Analysis , 2005 .

[24]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[25]  Adrian E. Raftery,et al.  Incremental Model-Based Clustering for Large Datasets With Small Clusters , 2005 .

[26]  Franklina Maria Bragion de Toledo,et al.  Heuristics for minimizing the maximum within-clusters distance , 2012 .

[27]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[28]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[29]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[30]  Dorit S. Hochbaum,et al.  When are NP-hard location problems easy? , 1984, Ann. Oper. Res..