Optimal Distributed Submodular Optimization via Sketching

We present distributed algorithms for several classes of submodular optimization problems such as k-cover, set cover, facility location, and probabilistic coverage. The new algorithms enjoy almost optimal space complexity, optimal approximation guarantees, optimal communication complexity (and run in only four rounds of computation), addressing major shortcomings of prior work. We first present a distributed algorithm for k-cover using only Õ(n) space per machine, and then extend it to several submodular optimization problems, improving previous results for all the above problems-e.g., our algorithm for facility location problem improves the space of the best-known algorithm (Lindgren et al.). Our algorithms are implementable in various distributed frameworks such as MapReduce and RAM models. On the hardness side, we demonstrate the limitations of uniform sampling via an information theoretic argument. Furthermore, we perform an extensive empirical study of our algorithms (implemented in MapReduce) on a variety of datasets. We observe that using sketches 30-600 times smaller than the input, one can solve the coverage maximization problem with quality very close to that of the state-of-the-art single machine algorithm. Finally, we demonstrate an application of our algorithm in large-scale feature selection

[1]  Andrew McGregor,et al.  Better Streaming Algorithms for the Maximum Coverage Problem , 2017, ICDT.

[2]  Rishabh K. Iyer,et al.  Fast Multi-stage Submodular Maximization , 2014, ICML.

[3]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[4]  Charles Carpenter Fries,et al.  English word lists , 1950 .

[5]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[6]  Vahab S. Mirrokni,et al.  Almost Optimal Streaming Algorithms for Coverage Problems , 2016, SPAA.

[7]  Alexandros G. Dimakis,et al.  Leveraging Sparsity for Efficient Submodular Data Summarization , 2017, NIPS.

[8]  Vahab S. Mirrokni,et al.  Diversity maximization under matroid constraints , 2013, KDD.

[9]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[10]  Huy L. Nguyen,et al.  A New Framework for Distributed Submodular Maximization , 2015, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[11]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.

[12]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[13]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[14]  Morteza Zadimoghaddam,et al.  Fast Distributed Submodular Cover: Public-Private Data Summarization , 2016, NIPS.

[15]  Yuli Ye,et al.  Max-Sum diversification, monotone submodular functions and dynamic updates , 2012, PODS '12.

[16]  My T. Thai,et al.  Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks , 2016, SIGMOD Conference.

[17]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[18]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Andreas Krause,et al.  Distributed Submodular Cover: Succinctly Summarizing Massive Data , 2015, NIPS.

[22]  Huy L. Nguyen,et al.  The Power of Randomization: Distributed Submodular Maximization on Massive Datasets , 2015, ICML.

[23]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[24]  Jan Vondrák,et al.  Fast algorithms for maximizing submodular functions , 2014, SODA.

[25]  Guy E. Blelloch,et al.  Parallel and I/O efficient set covering algorithms , 2012, SPAA '12.

[26]  Tim Roughgarden,et al.  Sketching valuation functions , 2012, SODA.

[27]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[28]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[29]  Aditya Bhaskara,et al.  Greedy Column Subset Selection: New Bounds and Distributed Algorithms , 2016, ICML.

[30]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[31]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[32]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[33]  Lise Getoor,et al.  On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch , 2009, SDM.

[34]  Jure Leskovec,et al.  Governance in Social Media: A Case Study of the Wikipedia Promotion Process , 2010, ICWSM.