High-performance geometric algorithms for sparse computation in big data analytics

Several leading supervised and unsupervised machine learning algorithms require as input similarities between objects in a data set. Since the number of pairwise similarities grows quadratically with the size of the data set, it is computationally prohibitive to compute all pairwise similarities for large-scale data sets. The recently introduced methodology of “sparse computation” resolves this issue by computing only the relevant similarities instead of all pairwise similarities. To identify the relevant similarities, sparse computation efficiently projects the data onto a low-dimensional space where a similarity is considered relevant if the corresponding objects are close in this space. The relevant similarities are then computed in the original space. Sparse computation identifies close pairs by partitioning the low-dimensional space into grid blocks, and considering objects close if they fall in the same or adjacent grid blocks. This guarantees that all pairs of objects that are within a specified L∞ distance are identified as well as some pairs that are within twice this distance. For very large data sets, sparse computation can have high runtime due to the enumeration of pairs of adjacent blocks. We propose here new geometric algorithms that eliminate the need to enumerate adjacent blocks. Our empirical results on data sets with up to 10 million objects show that the new algorithms achieve a significant reduction in runtime. The algorithms have applications in large-scale computational geometry and (approximate) nearest neighbor search. Python implementations of the proposed algorithms are publicly available.

[1]  Dorit S. Hochbaum Polynomial Time Algorithms for Ratio Regions and a Variant of Normalized Cut , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Philipp Baumann Sparse-reduced computation for large-scale spectral clustering , 2016, 2016 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM).

[3]  Chetan Jhurani Subspace-preserving sparsification of matrices with minimal perturbation to the near null-space. Part I: Basics , 2013, ArXiv.

[4]  William Stafford Noble,et al.  Support vector machine , 2013 .

[5]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[6]  Dorit S. Hochbaum,et al.  Sparse computation for large-scale data mining , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[7]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[8]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[9]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[10]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[11]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[12]  Shang-Hua Teng,et al.  Spectral Sparsification of Graphs , 2008, SIAM J. Comput..

[13]  Jean-Daniel Boissonnat,et al.  Proceedings of the twentieth annual symposium on Computational geometry , 2004, SoCG 2004.

[14]  Quico Spaen,et al.  HNCcorr: A Novel Combinatorial Approach for Cell Identification in Calcium-Imaging Movies , 2017, eNeuro.

[15]  Quico Spaen,et al.  Sparse-Reduced Computation - Enabling Mining of Massively-large Data Sets , 2016, ICPRAM.

[16]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[17]  Ye-In Chang,et al.  All-nearest-neighbors finding based on the Hilbert curve , 2011, Expert Syst. Appl..

[18]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[19]  Sanjeev Arora,et al.  A Fast Random Sampling Algorithm for Sparsifying Matrices , 2006, APPROX-RANDOM.

[20]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[21]  Jian-xiong Dong,et al.  Fast SVM training algorithm with decomposition on very large data sets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Dorit S. Hochbaum,et al.  A comparative study of the leading machine learning techniques and two new optimization algorithms , 2019, Eur. J. Oper. Res..

[23]  Shan Suthaharan,et al.  Support Vector Machine , 2016 .

[24]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[25]  David J. Slate,et al.  Letter Recognition Using Holland-Style Adaptive Classifiers , 1991, Machine Learning.