Efficient and private distance approximation in the communication and streaming models

This thesis studies distance approximation in two closely related models - the streaming model and the two-party communication model. In the streaming model, a massive data stream is presented in an arbitrary order to a randomized algorithm that tries to approximate certain statistics of the data with only a few (usually one) passes over the data. For instance, the data may be a flow of packets on the internet or a set of records in a huge database. The size of the data necessitates the use of extremely efficient randomized approximation algorithms. Problems of interest include approximating the number of distinct elements, approximating the surprise index of a stream, or more generally, approximating the norm of a dynamically-changing vector in which coordinates are updated multiple times in an arbitrary order. In the two-party communication model, there are two parties who wish to efficiently compute a relation of their inputs. We consider the problem of approximating Lp distances for any p ≥ 0. It turns out that lower bounds on the communication complexity of these relations yield lower bounds on the memory required of streaming algorithms for the problems listed above. Moreover, upper bounds in the streaming model translate to constant-round protocols in the communication model with communication proportional to the memory required of the streaming algorithm. The communication model also has its own applications, such its secure datamining, where in addition to low communication, the goal is not to allow either party to learn more about the other's input other than what follows from the output and his/her private input. We develop new algorithms and lower bounds that resolve key open questions in both of these models. The highlights of the results are as follows. (1) We give an Ω(1/e2) lower bound for approximating the number of distinct elements of a data stream in one pass to within a (1 ± e) factor with constant probability, as well us the p-th frequency moment Fp for any p ≥ 0. This is tight up to very small factors, and greatly improves upon the earlier Ω(1/e) lower bound for these problems. It also gives the same quadratic improvement for the communication complexity of 1-round protocols for approximating the Lp distance for any p ≥ 0. (2) We give a 1-pass O(m1-2/ p)-space streaming algorithm for (1 ± e)-approximating the Lp norm of an m-dimensional vector presented as a data stream for any p ≥ 2. This algorithm improves the previous O(m 1-1/(p-1)) bound, and is optimal up to polylogarithmic factors. As a special ease our algorithm can be used to approximate the frequency moments Fp of a data stream with the same optimal amount of space. This resolves the main open question of the 1996 paper by Alon, Matias, and Szegedy. (3) In the two-party communication model, we give a protocol for privately approximating the Euclidean distance (L2) between two m-dimensional vectors, held by different parties, with only polylog m communication and O(1) rounds. This tremendously improves upon the earlier protocol of Feigenbaum, Ishai, Malkin, Nissim, Strauss, and Wright, which achieved O( m ) communication for privately approximating the Hamming distance only. This thesis also contains several previously unpublished results concerning the first item above, including new lower bounds for the communication complexity of approximating the Lp distances when the vectors are uniformly distributed and the protocol is only correct for most inputs, as well as tight lower bounds for the multiround complexity for a restricted class of protocols that we call linear. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[2]  Yehuda Lindell,et al.  Universally composable two-party and multi-party secure computation , 2002, STOC '02.

[3]  Ilan Newman,et al.  Private vs. Common Random Bits in Communication Complexity , 1991, Inf. Process. Lett..

[4]  Christopher M. Jones,et al.  An introduction to coding theory. , 2001 .

[5]  Robert Krauthgamer,et al.  Private approximation of NP-hard functions , 2001, STOC '01.

[6]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[7]  Russell Impagliazzo,et al.  One-way functions are essential for complexity based cryptography , 1989, 30th Annual Symposium on Foundations of Computer Science.

[8]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[9]  Joan Feigenbaum,et al.  Secure multiparty computation of approximations , 2001, TALG.

[10]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  V. Milman,et al.  Asymptotic Theory Of Finite Dimensional Normed Spaces , 1986 .

[13]  Jaap-Henk Hoepman,et al.  Fuzzy Private Matching , 2006 .

[14]  Yuval Ishai,et al.  Protecting data privacy in private information retrieval schemes , 1998, STOC '98.

[15]  David P. Woodruff,et al.  Polylogarithmic Private Approximations and Efficient Matching , 2006, TCC.

[16]  Moni Naor,et al.  Communication preserving protocols for secure function evaluation , 2001, STOC '01.

[17]  Joe Kilian,et al.  One-Round Secure Computation and Secure Autonomous Mobile Agents , 2000, ICALP.

[18]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[19]  Ziv Bar-Yossef,et al.  Information theory methods in communication complexity , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[20]  Wenliang Du,et al.  Protocols for Secure Remote Database Access with Approximate Matching , 2001, E-Commerce Security and Privacy.

[21]  J. H. van Lint,et al.  Introduction to Coding Theory , 1982 .

[22]  J. H. van Lint,et al.  Introduction to Coding Theory , 1982 .

[23]  Yuval Ishai,et al.  Reducing the Servers’ Computation in Private Information Retrieval: PIR with Preprocessing , 2004, Journal of Cryptology.

[24]  Moni Naor,et al.  Private Information Retrieval by Keywords , 1998, IACR Cryptol. ePrint Arch..

[25]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[26]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[27]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[28]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[29]  Andrew Chi-Chih Yao,et al.  Some complexity questions related to distributive computing(Preliminary Report) , 1979, STOC.

[30]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[31]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[32]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[33]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[34]  Leonid A. Levin,et al.  A Pseudorandom Generator from any One-way Function , 1999, SIAM J. Comput..

[35]  Andrew Chi-Chih Yao,et al.  Lower Bounds by Probabilistic Arguments (Extended Abstract) , 1983, FOCS 1983.

[36]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[37]  Silvio Micali,et al.  Computationally Private Information Retrieval with Polylogarithmic Communication , 1999, EUROCRYPT.

[38]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[39]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[40]  Xuan Zheng,et al.  Private Approximate Heavy Hitters , 2006, ArXiv.

[41]  Silvio Micali,et al.  Probabilistic Encryption , 1984, J. Comput. Syst. Sci..

[42]  Jacques Stern,et al.  A New Public-Key Cryptosystem , 1997, EUROCRYPT.

[43]  Andrew Chi-Chih Yao,et al.  Protocols for Secure Computations (Extended Abstract) , 1982, FOCS.

[44]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[45]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[46]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[47]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[48]  Eyal Kushilevitz,et al.  Private information retrieval , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[49]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[50]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[51]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[52]  Alexandr Andoni,et al.  On the Optimality of the Dimensionality Reduction Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[53]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[54]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[55]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[56]  Amos Beimel,et al.  Private Approximation of Clustering and Vertex Cover , 2007, computational complexity.

[57]  Noam Nisan,et al.  Errata for: "On randomized one-round communication complexity" , 2001, computational complexity.

[58]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[59]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[60]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[61]  Moni Naor,et al.  Oblivious transfer and polynomial evaluation , 1999, STOC '99.

[62]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[63]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[64]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[65]  Benny Pinkas,et al.  Efficient Private Matching and Set Intersection , 2004, EUROCRYPT.

[66]  Josh Benaloh Verifiable secret-ballot elections , 1987 .

[67]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[68]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .