Sketching in adversarial environments

We formalize a realistic model for computations over massive data sets. The model, referred to as the {\em adversarial sketch model}, unifies the well-studied sketch and data stream models together with a cryptographic flavor that considers the execution of protocols in "hostile environments", and provides a framework for studying the complexity of many tasks involving massive data sets. The adversarial sketch model consists of several participating parties: honest parties, whose goal is to compute a pre-determined function of their inputs, and an adversarial party. Computation in this model proceeds in two phases. In the first phase, the adversarial party chooses the inputs of the honest parties. These inputs are sets of elements taken from a large universe, and provided to the honest parties in an on-line manner in the form of a sequence of insert and delete operations. Once an operation from the sequence has been processed it is discarded and cannot be retrieved unless explicitly stored. During this phase the honest parties are not allowed to communicate. Moreover, they do not share any secret information and any public information they share is known to the adversary in advance. In the second phase, the honest parties engage in a protocol in order to compute a pre-determined function of their inputs. In this paper we settle the complexity (up to logarithmic factors) of two fundamental problems in this model: testing whether two massive data sets are equal, and approximating the size of their symmetric difference. We construct explicit and efficient protocols with sublinear sketches of essentially optimal size, poly-logarithmic update time during the first phase, and poly-logarithmic communication and computation during the second phase. Our main technical contribution is an explicit and deterministic encoding scheme that enjoys two seemingly conflicting properties: incrementality and high distance, which may be of independent interest.

[1]  Michael Sipser,et al.  Expanders, Randomness, or Time versus Space , 1988, Journal of computer and system sciences (Print).

[2]  Joan Feigenbaum,et al.  Secure multiparty computation of approximations , 2001, TALG.

[3]  R. Vershynin,et al.  One sketch for all: fast algorithms for compressed sensing , 2007, STOC '07.

[4]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[5]  Graham Cormode,et al.  Combinatorial Algorithms for Compressed Sensing , 2006, 2006 40th Annual Conference on Information Sciences and Systems.

[6]  Moni Naor,et al.  Deterministic History-Independent Strategies for Storing Information on Write-Once Memories , 2007, Theory Comput..

[7]  Noam Nisan,et al.  On Randomized One-round Communication Complexity , 1995, STOC '95.

[8]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[9]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[10]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[11]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[14]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[15]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[16]  Matthew K. Franklin,et al.  An Efficient Public Key Traitor Tracing Scheme , 1999, CRYPTO.

[17]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[18]  David P. Woodruff,et al.  Polylogarithmic Private Approximations and Efficient Matching , 2006, TCC.

[19]  Piotr Indyk,et al.  Explicit constructions of selectors and related combinatorial structures, with applications , 2002, SODA '02.

[20]  Mihir Bellare,et al.  A New Paradigm for Collision-Free Hashing: Incrementality at Reduced Cost , 1997, EUROCRYPT.

[21]  Andrew Chi-Chih Yao,et al.  Some complexity questions related to distributive computing(Preliminary Report) , 1979, STOC.

[22]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[23]  Ronitt Rubinfeld,et al.  Robust Characterizations of Polynomials with Applications to Program Testing , 1996, SIAM J. Comput..

[24]  Mihir Bellare,et al.  Incremental Cryptography: The Case of Hashing and Signing , 1994, CRYPTO.

[25]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[26]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[27]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[28]  Moni Naor,et al.  Amortized Communication Complexity , 1995, SIAM J. Comput..

[29]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[30]  Amnon Ta-Shma,et al.  Lossless Condensers, Unbalanced Expanders, And Extractors , 2007, Comb..

[31]  Ilan Newman,et al.  Public vs. private coin flips in one round communication games (extended abstract) , 1996, STOC '96.

[32]  Sumit Ganguly,et al.  Deterministic k-set structure , 2006, PODS '06.

[33]  Manuel Blum,et al.  Checking the correctness of memories , 2005, Algorithmica.

[34]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[35]  Shai Halevi,et al.  Secure Hash-and-Sign Signatures Without the Random Oracle , 1999, EUROCRYPT.

[36]  Panos M. Pardalos,et al.  Handbook of Massive Data Sets , 2002, Massive Computing.

[37]  Noam Nisan,et al.  Extracting Randomness: A Survey and New Constructions , 1999, J. Comput. Syst. Sci..

[38]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[39]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[40]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[41]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[42]  Dan S. Wallach,et al.  Denial of Service via Algorithmic Complexity Attacks , 2003, USENIX Security Symposium.

[43]  Moni Naor,et al.  Small-bias probability spaces: efficient constructions and applications , 1990, STOC '90.

[44]  László Babai,et al.  Randomized simultaneous messages: solution of a problem of Yao in communication complexity , 1997, Proceedings of Computational Complexity. Twelfth Annual IEEE Conference.

[45]  Piotr Indyk Explicit constructions for compressed sensing of sparse signals , 2008, SODA '08.