MSPP: A Highly Efficient and Scalable Algorithm for Mining Similar Pairs of Points

The closest pair of points problem or closest pair problem (CPP) is an important problem in computational geometry where we have to find a pair of points from a set of points in metric space with the smallest distance between them. This problem arises in a number of applications, such as but not limited to clustering, graph partitioning, image processing, patterns identification, and intrusion detection. For example, in air-traffic control, we must monitor aircrafts that come too close together, since this may potentially indicate a possible collision. Numerous algorithms have been presented for solving the CPP. The algorithms that are employed in practice have a worst case quadratic run time complexity. In this article we present an elegant approximation algorithm for the CPP called MSPP: Mining Similar Pairs of Points. It is faster than currently best known algorithms while maintaining a very good accuracy. The proposed algorithm also detects a set of closely similar pairs of points in Euclidean and Pearson metric spaces and can be adapted in numerous real world applications, such as clustering, dimension reduction, constructing and analyzing gene/transcript co-expression network, among others.

[1]  Ying Wu,et al.  Mining Motifs from Human Motion , 2008, Eurographics.

[2]  Jessica Lin,et al.  Linear Time Motif Discovery in Time Series , 2019, SDM.

[3]  Catherine Garbay,et al.  Knowledge construction from time series data using a collaborative exploration system , 2007, J. Biomed. Informatics.

[4]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[5]  Sanguthevar Rajasekaran,et al.  JUMP: A Fast Deterministic Algorithm to Find the Closest Pair of Subsequences , 2018, SDM.

[6]  Andrew Chi-Chih Yao Lower bounds for algebraic computation trees with integer inputs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  John E. Hopcroft,et al.  A Note on Rabin's Nearest-Neighbor Algorithm , 1978, Inf. Process. Lett..

[8]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[9]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[10]  Irfan A. Essa,et al.  Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning , 2007, AAAI.

[11]  A. Paz Probabilistic algorithms , 2003 .

[12]  Kuniaki Uehara,et al.  Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle , 2005, Machine Learning.

[13]  Samir Khuller,et al.  A Simple Randomized Sieve Algorithm for the Closest-Pair Problem , 1995, Inf. Comput..

[14]  Philippe Beaudoin,et al.  Motion-motif graphs , 2008, SCA '08.

[15]  Giorgio Terracina,et al.  Discovering Representative Models in Large Time Series Databases , 2004, FQAS.

[16]  Klaus Sutner Probabilistic Algorithms , 2017 .

[17]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[18]  Eamonn J. Keogh,et al.  Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).