Nimble Algorithms for Cloud Computing

Cloud computing is a new paradigm where data is stored across multiple servers and the goal is to compute a function of all the data. We consider a simple model where each server uses polynomial time and space, but communication among servers being more expensive is ideally bounded by a polylogarithmic function of the input size. We will dub algorithms that satisfy these types of resource bounds as nimble. The main contribution of the paper is to develop nimble algorithms for several areas which involve massive data and for that reason have been extensively studied in the context of Streaming Algorithms. The areas are approximation of Frequency Moments, Counting bipartite homomorphisms (number of copies of a fixed bipartite graph H in a graph G), Rank-k approximation to a matrix, and Clustering. For frequency moments, we will use a new importance sampling technique based on high powers of the frequencies. We reduce the problem of counting homomorphisms to estimating implicitly defined frequency moments. For rank-k approximations, besides recent results of several authors developed in the Streaming context, we use a new variant of the random projection method. For clustering, we use our rank-k approximation and the small coreset of Chen [15] of size at most polynomial in the dimension. In contrast to our algorithms in the cloud computing model, in the streaming model, known lower bound results for frequency moments and rank-k approximations rule out the existence of algorithms that use polylogarithmic space. Microsoft Research India. Email:kannan@microsoft.com Georgia Tech. Email: vempala@gatech.edu 1

[1]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[2]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.

[3]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[4]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[5]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Ravi Kannan,et al.  A New Probability Inequality Using Typical Moments and Concentration Results , 2008, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Amin Coja-Oghlan,et al.  Graph Partitioning via Adaptive Spectral Techniques , 2009, Combinatorics, Probability and Computing.

[9]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[10]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[11]  Cynthia Dwork,et al.  The Promise of Differential Privacy: A Tutorial on Algorithmic Techniques , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[12]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[13]  Santosh S. Vempala,et al.  Spectral Algorithms , 2009, Found. Trends Theor. Comput. Sci..

[14]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[15]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[17]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[18]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[19]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[20]  Shanmugavelayutham Muthukrishnan,et al.  Data Stream Algorithms , 2005 .

[21]  Avishek Saha,et al.  Protocols for Learning Classifiers on Distributed Data , 2012, AISTATS.

[22]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[23]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[24]  Avishek Saha,et al.  Efficient Protocols for Distributed Classification and Optimization , 2012, ALT.

[25]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[26]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..

[27]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[28]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.