FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance

FISHDBC is a flexible, incremental, scalable, and hierarchical density-based clustering algorithm. It is flexible because it empowers users to work on arbitrary data, skipping the feature extraction step that usually transforms raw data in numeric arrays letting users define an arbitrary distance function instead. It is incremental and scalable: it avoids the $\mathcal O(n^2)$ performance of other approaches in non-metric spaces and requires only lightweight computation to update the clustering when few items are added. It is hierarchical: it produces a "flat" clustering which can be expanded to a tree structure, so that users can group and/or divide clusters in sub- or super-clusters when data exploration requires so. It is density-based and approximates HDBSCAN*, an evolution of DBSCAN.

[1]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[2]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[3]  Edward Raff,et al.  Lempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash , 2017, Digit. Investig..

[4]  Ira Assent,et al.  Anytime parallel density-based clustering , 2018, Data Mining and Knowledge Discovery.

[5]  Matteo Dell'Amico,et al.  NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[6]  Klaus-Robert Müller,et al.  Feature Discovery in Non-Metric Pairwise Data , 2004, J. Mach. Learn. Res..

[7]  James Bailey,et al.  Adjusting for Chance Clustering Comparison Measures , 2015, J. Mach. Learn. Res..

[8]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[9]  Hans-Peter Kriegel,et al.  Efficient density-based clustering of complex objects , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[10]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[11]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[12]  Thibault Debatty,et al.  Scalable k-NN based text clustering , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[13]  Jonathan Oliver,et al.  TLSH -- A Locality Sensitive Hash , 2013, 2013 Fourth Cybercrime and Trustworthy Computing Workshop.

[14]  Hans-Peter Kriegel,et al.  Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering , 2003, DaWaK.

[15]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  A. Azzouz 2011 , 2020, City.

[18]  David Eppstein Offline Algorithms for Dynamic Minimum Spanning Tree Problems , 1994, J. Algorithms.

[19]  Yun Liu,et al.  ICA: An Incremental Clustering Algorithm Based on OPTICS , 2015, Wireless Personal Communications.

[20]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[21]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[22]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.

[23]  Leland McInnes,et al.  Accelerated Hierarchical Density Based Clustering , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[24]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[25]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[26]  Davide Balzarotti,et al.  Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis , 2018, CODASPY.

[27]  Maurizio Filippone,et al.  Mini-batch spectral clustering , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[28]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Martin Ester,et al.  Incremental Generalization for Mining in a Data Warehousing Environment , 1998, EDBT.

[30]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[31]  Yong-Yeol Ahn,et al.  The Impact of Random Models on Clustering Similarity , 2017, bioRxiv.

[32]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Maurizio Filippone,et al.  Dealing with non-metric dissimilarities in fuzzy central clustering algorithms , 2009, Int. J. Approx. Reason..

[34]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[35]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[36]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..