The single pixel GPS: learning big data signals from tiny coresets

We present algorithms for simplifying and clustering patterns from sensors such as GPS, LiDAR, and other devices that can produce high-dimensional signals. The algorithms are suitable for handling very large (e.g. terabytes) streaming data and can be run in parallel on networks or clouds. Applications include compression, denoising, activity recognition, road matching, and map generation. We encode these problems as (k, m)-segment mean problems. Formally, we provide (1 + ε)-approximations to the k-segment and (k, m)-segment mean of a d-dimensional discrete-time signal. The k-segment mean is a k-piecewise linear function that minimizes the regression distance to the signal. The (k,m)-segment mean has an additional constraint that the projection of the k segments on Rd consists of only m ≤ k segments. Existing algorithms for these problems take O(kn2) and nO(mk) time respectively and O(kn2) space, where n is the length of the signal. Our main tool is a new coreset for discrete-time signals. The coreset is a smart compression of the input signal that allows computation of a (1 + ε)-approximation to the k-segment or (k,m)-segment mean in O(n log n) time for arbitrary constants ε,k, and m. We use coresets to obtain a parallel algorithm that scans the signal in one pass, using space and update time per point that is polynomial in log n. We provide empirical evaluations of the quality of our coreset and experimental results that show how our coreset boosts both inefficient optimal algorithms and existing heuristics. We demonstrate our results for extracting signals from GPS traces. However, the results are more general and applicable to other types of sensors.

[1]  Jae-Gil Lee,et al.  Trajectory clustering: a partition-and-group framework , 2007, SIGMOD '07.

[2]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[3]  Kai-Florian Richter,et al.  Semantic trajectory compression: Representing urban movement in a nutshell , 2012, J. Spatial Inf. Sci..

[4]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[5]  Bernhard Mitschang,et al.  Usability analysis of compression algorithms for position data streams , 2010, GIS '10.

[6]  S. Johansen,et al.  MAXIMUM LIKELIHOOD ESTIMATION AND INFERENCE ON COINTEGRATION — WITH APPLICATIONS TO THE DEMAND FOR MONEY , 2009 .

[7]  Matthias Grossglauser,et al.  CRAWDAD dataset epfl/mobility (v.2009-02-24) , 2009 .

[8]  Vania Bogorny,et al.  A model for enriching trajectories with semantic geographical information , 2007, GIS.

[9]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[10]  David H. Douglas,et al.  ALGORITHMS FOR THE REDUCTION OF THE NUMBER OF POINTS REQUIRED TO REPRESENT A DIGITIZED LINE OR ITS CARICATURE , 1973 .

[11]  Ouri Wolfson,et al.  On-line data reduction and the quality of history in moving objects databases , 2006, MobiDE '06.

[12]  John Krumm,et al.  Hidden Markov map matching through noise and sparseness , 2009, GIS.

[13]  Gang Chen,et al.  Mining Frequent Trajectory Patterns from GPS Tracks , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[14]  Pasi Fränti,et al.  Compression of GPS Trajectories , 2012, 2012 Data Compression Conference.

[15]  Ian T. Foster,et al.  Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[16]  D. Hawkins POINT ESTIMATION OF THE PARAMETERS OF PIECEWISE REGRESSION MODELS. , 1976 .

[17]  P. Lerman Fitting Segmented Regression Models by Grid Search , 1980 .

[18]  Tetsuo Asano,et al.  Number Theory Helps Line Detection in Digital Images , 1993, ISAAC.

[19]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[20]  Thomas K. Peucker,et al.  2. Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature , 2011 .

[21]  Verena Kantere,et al.  On-line discovery of hot motion paths , 2008, EDBT '08.

[22]  Thad Starner,et al.  Using GPS to learn significant locations and predict movement across multiple users , 2003, Personal and Ubiquitous Computing.

[23]  Dan Feldman,et al.  An effective coreset compression algorithm for large scale sensor networks , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[24]  Wang-Chien Lee,et al.  Semantic trajectory mining for location prediction , 2011, GIS.

[25]  Rolf Dach,et al.  Technical Report 2012 , 2013 .

[26]  Henry A. Kautz,et al.  Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields , 2007, Int. J. Robotics Res..

[27]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[28]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[29]  Mark de Berg,et al.  Streaming Algorithms for Line Simplification , 2007, SCG '07.

[30]  Ouri Wolfson,et al.  Spatio-temporal data reduction with deterministic error bounds , 2003, DIALM-POMC '03.