Fast Manhattan sketches in data streams

The L1-distance, also known as the Manhattan or taxicab distance, between two vectors <i>x, y</i> in R<sup><i>n</i></sup> is ∑_{i=1}over<i>n</i> |<i>x<sub>i</sub>-y_<sub>i</sub></i>|. Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1-pass streaming algorithm for this problem in the turnstile model with <i>O</i>*(1/ε<sup>2</sup>) space and <i>O</i>*(1) update time. The <i>O</i>* notation hides polylogarithmic factors in ε, <i>n</i>, and the precision required to store vector entries. All previous algorithms either required Ω(1/ε<sup>3</sup>) space or Ω(1/ε<sup>2</sup>) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to <i>O</i>*(1) factors.

[1]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[2]  David P. Woodruff,et al.  The Data Stream Space Complexity of Cascaded Norms , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[3]  S. Muthukrishnan,et al.  Functionally Private Approximations of Negligibly-Biased Estimators , 2009, FSTTCS.

[4]  V. Rao Vemuri,et al.  A Hardware-Based Clustering Approach for Anomaly Detection , 2005 .

[5]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[6]  Ming-Yang Kao,et al.  Reversible sketches: enabling monitoring and analysis over high-speed data streams , 2007, TNET.

[7]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[8]  R Agrawal,et al.  Fast mining of massive tabular data via approximate distance computations , 2002 .

[9]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[10]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[11]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[12]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[13]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[14]  Anna Pagh,et al.  Uniform Hashing in Constant Time and Optimal Space , 2008, SIAM J. Comput..

[15]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[16]  Nikos D. Sidiropoulos,et al.  Mathematical programming algorithms for regression-based nonlinear filtering in RN , 1999, IEEE Trans. Signal Process..

[17]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[18]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[19]  R. Vershynin,et al.  One sketch for all: fast algorithms for compressed sensing , 2007, STOC '07.

[20]  David P. Woodruff,et al.  A Near-Optimal Algorithm for L1-Difference , 2009, ArXiv.

[21]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[22]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[23]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[24]  Anna Pagh,et al.  Uniform hashing in constant time and linear space , 2003, STOC '03.

[25]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[26]  Carsten Lund,et al.  Algorithms and estimators for accurate summarization of internet traffic , 2007, IMC '07.

[27]  Graham Cormode,et al.  Time-decaying aggregates in out-of-order streams , 2008, PODS.

[28]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[29]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[30]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[31]  Murali S. Kodialam,et al.  DATALITE: a distributed architecture for traffic analysis via light-weight traffic digest , 2007, 2007 Fourth International Conference on Broadband Communications, Networks and Systems (BROADNETS '07).

[32]  T. S. Jayram,et al.  OPEN PROBLEMS IN DATA STREAMS AND RELATED TOPICS IITK WORKSHOP ON ALGORITHMS FOR DATA STREAMS ’06 , 2007 .

[33]  P. Rousseeuw,et al.  Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices , 1991 .

[34]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[35]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[36]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[37]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[38]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[39]  P. Parrilo,et al.  Semidefinite Representation of the k-Ellipse , 2007, math/0702005.

[40]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[41]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[42]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS.

[43]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[44]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[45]  Ping Li,et al.  Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections , 2008, SODA '08.

[46]  Csaba D. Tóth,et al.  Space complexity of hierarchical heavy hitters in multi-dimensional data streams , 2005, PODS '05.

[47]  M. Shirosaki Another proof of the defect relation for moving targets , 1991 .

[48]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[49]  Piotr Indyk,et al.  Space-optimal heavy hitters with strong error bounds , 2010, TODS.

[50]  Sumit Ganguly,et al.  Finding Frequent Items over General Update Streams , 2008, SSDBM.

[51]  Yadolah Dodge,et al.  L[1]-statistical procedures and related topics , 1997 .

[52]  David P. Woodruff,et al.  Polylogarithmic Private Approximations and Efficient Matching , 2006, TCC.

[53]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..