Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable yet representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GREEDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop.

[1]  Laurence A. Wolsey,et al.  Best Algorithms for Approximating the Maximum of a Submodular Set Function , 1978, Math. Oper. Res..

[2]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[3]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[4]  R. Varga,et al.  Proof of Theorem 4 , 1983 .

[5]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[6]  U. Feige A threshold of ln n for approximating set cover , 1998, JACM.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[9]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[10]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[11]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[13]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  Tore Opsahl,et al.  Clustering in weighted networks , 2009, Soc. Networks.

[16]  Andreas Krause,et al.  Online distributed sensor selection , 2010, IPSN '10.

[17]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[18]  Max A. Little,et al.  Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[20]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[21]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[22]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[23]  Andreas Krause,et al.  Submodularity and its applications in optimized information gathering , 2011, TIST.

[24]  Jeff A. Bilmes,et al.  Active Semi-Supervised Learning using Submodular Functions , 2011, UAI.

[25]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[26]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[27]  Benjamin Moseley,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[28]  Pushmeet Kohli,et al.  Tractability: Practical Approaches to Hard Problems , 2013 .

[29]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[30]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[31]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.