Exact clustering in linear time

The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear time computational complexity on clustering tasks. MIMOSA algorithms mark and match partial-signature keys in a hash table to obtain exact, error-free cluster retrieval. Benchmark measurements, on clustering a data set of 10,000,000 news articles by news topic, found that a MIMOSA implementation finished more than four orders of magnitude faster than a standard centroid implementation.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[4]  A. F. Adams,et al.  The Survey , 2021, Dyslexia in Higher Education.

[5]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[6]  Christian Hennig,et al.  Clustering strategy and method selection , 2015, 1503.02059.

[7]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[9]  Miin-Shen Yang A survey of fuzzy clustering , 1993 .

[10]  Robert C. Wolpert,et al.  A Review of the , 1985 .

[11]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[12]  Colin Fyfe,et al.  Online Clustering Algorithms , 2008, Int. J. Neural Syst..

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[15]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[16]  Jure Leskovec,et al.  Mining of Massive Datasets: Finding Similar Items , 2011 .

[17]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.