TOP: A Framework for Enabling Algorithmic Optimizations for Distance-Related Problems

Computing distances among data points is an essential part of many important algorithms in data analytics, graph analysis, and other domains. In each of these domains, developers have spent significant manual effort optimizing algorithms, often through novel applications of the triangle equality, in order to minimize the number of distance computations in the algorithms. In this work, we observe that many algorithms across these domains can be generalized as an instance of a generic distance-related abstraction. Based on this abstraction, we derive seven principles for correctly applying the triangular inequality to optimize distance-related algorithms. Guided by the findings, we develop T riangular OP timizer (TOP), the first software framework that is able to automatically produce optimized algorithms that either matches or outperforms manually designed algorithms for solving distance-related problems. TOP achieves up to 237x speedups and 2.5X on average.

[1]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[2]  Alexander G. Gray,et al.  Efficient exact k-NN and nonparametric classification in high dimensions , 2003, NIPS 2003.

[3]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[4]  Gérard G. Medioni,et al.  Object modeling by registration of multiple range images , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[5]  Andrew V. Goldberg,et al.  Computing the shortest path: A search meets graph theory , 2005, SODA '05.

[6]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[7]  Xueyi Wang,et al.  A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality , 2011, The 2011 International Joint Conference on Neural Networks.

[8]  Nawaf Bou-Rabee,et al.  Time Integrators for Molecular Dynamics , 2013, Entropy.

[9]  C. Elkan Nearest Neighbor Classification , 2007 .

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  Ronald J. Gutman,et al.  Reach-Based Routing: A New Approach to Shortest Path Algorithms Optimized for Road Networks , 2004, ALENEX/ANALC.

[12]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13]  Victor Eijkhout,et al.  Introduction to High Performance Scientific Computing , 2015 .

[14]  Guy Godin,et al.  Acceleration of Binning Nearest Neighbour Methods , 2000 .

[15]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[16]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .

[17]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[18]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[20]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[21]  Min Cheng,et al.  Improved O(N) neighbor list method using domain decomposition and data sorting , 2004 .

[22]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[23]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[25]  Jignesh M. Patel,et al.  Performance Comparison of the {\rm R}^{\ast}-Tree and the Quadtree for kNN and Distance Join Queries , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[27]  Hans-Peter Kriegel,et al.  Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning , 2010, SSDBM.

[28]  Michael A. Greenspan,et al.  A nearest neighbor method for efficient ICP , 2001, Proceedings Third International Conference on 3-D Digital Imaging and Modeling.

[29]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[30]  Yi-Ching Liaw,et al.  Fast k-nearest-neighbor search based on projection and triangular inequality , 2007, Pattern Recognit..

[31]  Hui Ding,et al.  Efficient Similarity Join of Large Sets of Moving Object Trajectories , 2008, 2008 15th International Symposium on Temporal Representation and Reasoning.

[32]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[33]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[34]  Michael A. Greenspan,et al.  Approximate k-d tree search for efficient ICP , 2003, Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings..

[35]  Kian-Lee Tan,et al.  Nearest group queries , 2013, SSDBM.

[36]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[37]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[38]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[39]  Matthew Goldstein,et al.  Kn -nearest Neighbor Classification , 1972, IEEE Trans. Inf. Theory.

[40]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .