Distributed Submodular Cover: Succinctly Summarizing Massive Data

How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition prevalent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is impractical for truly large-scale problems. In this work, we develop the first distributed algorithm - DISCOVER - for submodular set cover that is easily implementable using MapReduce-style computations. We theoretically analyze our approach, and present approximation guarantees for the solutions returned by DISCOVER. We also study a natural trade-off between the communication cost and the number of rounds required to obtain such a solution. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including active set selection, exemplar based clustering, and vertex cover on tens of millions of data points using Spark.

[1]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[2]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[3]  Dafna Shahaf,et al.  Turning down the noise in the blogosphere , 2009, KDD.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[6]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[7]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[10]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, KDD 2012.

[11]  Stergios Stergiou,et al.  Set Cover at Web Scale , 2015, KDD.

[12]  Carlos Guestrin,et al.  Beyond keyword search: discovering relevant scientific literature , 2011, KDD.

[13]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[14]  Laurence A. Wolsey,et al.  An analysis of the greedy algorithm for the submodular set covering problem , 1982, Comb..

[15]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[16]  Rishabh K. Iyer,et al.  Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints , 2013, NIPS.

[17]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.

[18]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[19]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[20]  Max A. Little,et al.  Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[22]  Bonnie Berger,et al.  Efficient NC Algorithms for Set Cover with Applications to Learning and Geometry , 1994, J. Comput. Syst. Sci..

[23]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[24]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[25]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[26]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[27]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[28]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[29]  Huy L. Nguyen,et al.  The Power of Randomization: Distributed Submodular Maximization on Massive Datasets , 2015, ICML.