Frequency estimation, also known as the Point Query problem, is one of the most fundamental problems in streaming algorithms. Given a stream S of elements from some universe U = {1 . . . n}, the goal is to compute, in a single pass, a short “sketch” of S so that for any element i ∈ U , one can estimate the number xi of times i occurs in S based on the sketch alone. Two state of the art solutions to this problems are Count-Min and Count-Sketch algorithms. They are based on linear sketches, which means that the data elements can be deleted as well as inserted and sketches for two different streams can be combined via addition. However, the guarantees offered by Count-Min and Count-Sketch are incomparable. The frequency estimator x̃ produced by Count-Min sketch, using O(1/ε · logn) dimensions, guarantees that (i) ‖x̃−x‖∞ ≤ ε‖x‖1 with high probability, and (ii) x̃ ≥ x holds deterministically. Also, Count-Min works under the assumption that x ≥ 0. On the other hand, Count-Sketch, using O(1/ε · log n) dimensions, guarantees that ‖x̃ − x‖∞ ≤ ε‖x‖2 with high probability. A natural question is whether it is possible to design the “best of both worlds” sketching method, with error guarantees depending on the l2 norm and space comparable to Count-Sketch, but (like Count-Min) also has the no-underestimation property. Our main set of results shows that the answer to the above question is negative. We show this in two incomparable computational models: linear sketching and streaming algorithms. Specifically, we show that: • Any linear sketch satisfying the lp norm error guarantee with probability at least 2/3 and having the no-underestimation property must be of dimension of at least Ω(n/ε), even if the sketched vectors are non-negative. This bound is tight, as we also give a linear sketch of dimension O(n/ε) satisfying these properties. • Any streaming algorithm satisfying the lp norm error guarantee with probability at least 2/3 and having the no-underestimation property must use at least Ω(n/ε) bits. This holds even for algorithms that only allow insertions and make any constant number of passes over the stream. This bound is tight up to a logarithmic factor. We also study the complementary problem, where the sketch is required to not over-estimate, i.e., x̃ ≤ x should hold always. We show that any linear sketch satisfying this property and having the lp error guarantee with probability at least 2/3 must be of dimension at least Ω(n/ε). We also show that this bound is tight up to polylogarithmic factors, by providing an appropriate linear sketch. MIT. Email: indyk@mit.edu MIT. Email: shyamsn@mit.edu CMU. Email: dwoodruf@cs.cmu.edu
[1]
A. Hoffman,et al.
Lower bounds for the rank and location of the Eigenvalues of a matrix
,
1953
.
[2]
Emmanuel J. Candès,et al.
Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information
,
2004,
IEEE Transactions on Information Theory.
[3]
Moses Charikar,et al.
Finding frequent items in data streams
,
2002,
Theor. Comput. Sci..
[4]
Graham Cormode,et al.
An improved data stream summary: the count-min sketch and its applications
,
2004,
J. Algorithms.
[5]
Joshua Brody,et al.
Streaming Algorithms with One-Sided Estimation
,
2011,
APPROX-RANDOM.
[6]
S. Leigh,et al.
Probability and Random Processes for Electrical Engineering
,
1989
.
[7]
Graham Cormode,et al.
Small Summaries for Big Data
,
2020
.
[8]
David P. Woodruff,et al.
On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation
,
2012,
APPROX-RANDOM.
[9]
S. Frick,et al.
Compressed Sensing
,
2014,
Computer Vision, A Reference Guide.
[10]
O. Taussky.
A Recurring Theorem on Determinants
,
1949
.
[11]
Avi Wigderson,et al.
Quantum vs. classical communication and computation
,
1998,
STOC '98.
[12]
Graham Cormode,et al.
Mergeable summaries
,
2012,
PODS '12.
[13]
C. Papadimitriou,et al.
The complexity of massive data set computations
,
2002
.
[14]
Vasileios Nakos,et al.
Deterministic Heavy Hitters with Sublinear Query Time
,
2017,
APPROX-RANDOM.
[15]
David P. Woodruff,et al.
Towards Optimal Moment Estimation in Streaming and Distributed Models
,
2019,
APPROX-RANDOM.
[16]
Noam Nisan,et al.
Randomness is Linear in Space
,
1996,
J. Comput. Syst. Sci..
[17]
Ziv Bar-Yossef,et al.
An information statistics approach to data stream and communication complexity
,
2002,
The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..
[18]
David P. Woodruff,et al.
A simple proof of a new set disjointness with applications to data streams
,
2021,
CCC.
[19]
Piotr Indyk,et al.
Sparse Recovery Using Sparse Matrices
,
2010,
Proceedings of the IEEE.