Distributed Submodular Maximization

Many large-scale machine learning problems--clustering, non-parametric learning, kernel machines, etc.--require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Classical approaches to submodular optimization require centralized access to the full dataset, which is impractical for truly large-scale problems. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show that under certain natural conditions, performance close to the centralized approach can be achieved. We begin with monotone submodular maximization subject to a cardinality constraint, and then extend this approach to obtain approximation guarantees for (not necessarily monotone) submodular maximization subject to more general constraints including matroid or knapsack constraints. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.

[1]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[5]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[6]  Huy L. Nguyen,et al.  The Power of Randomization: Distributed Submodular Maximization on Massive Datasets , 2015, ICML.

[7]  Aaron Roth,et al.  Constrained Non-monotone Submodular Maximization: Offline and Secretary Algorithms , 2010, WINE.

[8]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[9]  Jan Vondrák,et al.  Maximizing a Monotone Submodular Function Subject to a Matroid Constraint , 2011, SIAM J. Comput..

[10]  Baharan Mirzasoleiman,et al.  Revenue maximization in social networks through discounting , 2012, Social Network Analysis and Mining.

[11]  Maxim Sviridenko,et al.  A note on maximizing a submodular set function subject to a knapsack constraint , 2004, Oper. Res. Lett..

[12]  Jan Vondrák,et al.  Submodular maximization by simulated annealing , 2010, SODA '11.

[13]  Carlos Guestrin,et al.  A Note on the Budgeted Maximization of Submodular Functions , 2005 .

[14]  Ramasuri Narayanam,et al.  Viral Marketing for Product Cross-Sell through Social Networks , 2012, ECML/PKDD.

[15]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[16]  Rishabh K. Iyer,et al.  Fast Multi-stage Submodular Maximization , 2014, ICML.

[17]  Laurence A. Wolsey,et al.  Best Algorithms for Approximating the Maximum of a Submodular Set Function , 1978, Math. Oper. Res..

[18]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[19]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[20]  Pushmeet Kohli,et al.  Tractability: Practical Approaches to Hard Problems , 2013 .

[21]  Gérard Cornuéjols,et al.  Submodular set functions, matroids and the greedy algorithm: Tight worst-case bounds and some generalizations of the Rado-Edmonds theorem , 1984, Discret. Appl. Math..

[22]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[23]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[24]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[25]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[26]  Jan Vondrák,et al.  Submodular Maximization over Multiple Matroids via Generalized Exchange Properties , 2009, Math. Oper. Res..

[27]  Le Song,et al.  Budgeted Influence Maximization for Multiple Products , 2013, 1312.2164.

[28]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[29]  Maurice Queyranne,et al.  An Exact Algorithm for Maximum Entropy Sampling , 1995, Oper. Res..

[30]  Andreas Krause,et al.  Distributed Submodular Cover: Succinctly Summarizing Massive Data , 2015, NIPS.

[31]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[32]  Andreas Krause,et al.  Online distributed sensor selection , 2010, IPSN '10.

[33]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[34]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[35]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[36]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[37]  Baharan Mirzasoleiman,et al.  Immunizing complex networks with limited budget , 2012 .

[38]  Max A. Little,et al.  Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Vahab S. Mirrokni,et al.  Optimal marketing strategies over social networks , 2008, WWW.

[40]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[41]  Jeff A. Bilmes,et al.  Active Semi-Supervised Learning using Submodular Functions , 2011, UAI.

[42]  Andreas Krause,et al.  Submodularity and its applications in optimized information gathering , 2011, TIST.

[43]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[44]  Hadas Shachnai,et al.  Maximizing submodular set functions subject to multiple linear constraints , 2009, SODA.

[45]  Joseph Naor,et al.  A Tight Linear Time (1/2)-Approximation for Unconstrained Submodular Maximization , 2015, SIAM J. Comput..

[46]  Baharan Mirzasoleiman,et al.  Fast Constrained Submodular Maximization: Personalized Data Summarization , 2016, ICML.

[47]  Joseph Naor,et al.  Submodular Maximization with Cardinality Constraints , 2014, SODA.

[48]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[49]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[50]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[51]  Vijay Kumar,et al.  Approximation Algorithms for Budget-Constrained Auctions , 2001, RANDOM-APPROX.

[52]  Vahab S. Mirrokni,et al.  Diversity maximization under matroid constraints , 2013, KDD.

[53]  Jan Vondrák,et al.  Fast algorithms for maximizing submodular functions , 2014, SODA.

[54]  Vahab S. Mirrokni,et al.  Non-monotone submodular maximization under matroid and knapsack constraints , 2009, STOC '09.

[55]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[56]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[57]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[58]  Joseph Naor,et al.  A Tight Linear Time (1/2)-Approximation for Unconstrained Submodular Maximization , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[59]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[60]  Tore Opsahl,et al.  Clustering in weighted networks , 2009, Soc. Networks.

[61]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[62]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[63]  Sven de Vries,et al.  Combinatorial Auctions: A Survey , 2003, INFORMS J. Comput..

[64]  Andreas Krause,et al.  Online Learning of Assignments , 2009, NIPS.

[65]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.

[66]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[67]  Kaiwei,et al.  Fast Multi-Stage Submodular Maximization : Extended version , 2014 .

[68]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.