Set Coverage Problems in a One-Pass Data Stream

Finding a maximum coverage by k sets from a given collection (Max-k-Cover), finding a minimum number of sets with a required coverage (Partial-Cover) are both important combinatorial optimization problems. Various problems from data mining, machine learning, social network analysis, operational research, etc. can be generalized as a set coverage problem. The standard greedy algorithm is efficient as an in-memory algorithm. However, when we are facing very large-scale dataset or in an online environment, we seek a new algorithm which makes only one pass through the entire dataset. Previous one-pass algorithms for the Max-k-Cover problem cannot be extended to the Partial-Cover problem and do not enjoy the prefix-optimal property. In this paper, we propose a novel onepass streaming algorithm which produces a prefix-optimal ordering of sets, which can easily be used to solve the Max-k-Cover and the Partial-Cover problems. Our algorithm consumes space linear to the size of the universe of elements. The processing time for a set is linear to the size of this set. We also show with the aid of computer simulation that the approximation ratio of the Max-k-Cover problem is around 0.3. We conduct experiments on extensive datasets to compare our algorithm with existing one-pass algorithms on the Max-k-Cover problem, and with the standard greedy algorithm on the Partial-Cover problem. We demonstrate the efficiency and quality of our algorithm. Keyword: max-k-cover problem; one-pass stream; partial-cover problem

[1]  Cong Yu,et al.  It takes variety to make a world: diversification in recommender systems , 2009, EDBT '09.

[2]  Ojas Parekh,et al.  A Unified Approach to Approximating Partial Covering Problems , 2006, Algorithmica.

[3]  Panos M. Pardalos,et al.  Experimental Analysis of Approximation Algorithms for the Vertex Cover and Set Covering Problems , 2006, Comput. Oper. Res..

[4]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[5]  Tanya Y. Berger-Wolf,et al.  Sampling community structure , 2010, WWW '10.

[6]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[7]  Andreas Krause,et al.  Near-optimal Observation Selection using Submodular Functions , 2007, AAAI.

[8]  Guy E. Blelloch,et al.  Parallel and I/O efficient set covering algorithms , 2012, SPAA '12.

[9]  Andreas Krause,et al.  Greedy Dictionary Selection for Sparse Representation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[10]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[11]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[12]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[13]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[14]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[15]  Tom Brijs,et al.  Profiling high frequency accident locations using associations rules , 2002 .

[16]  Philip S. Yu,et al.  Near-optimal Supervised Feature Selection among Frequent Subgraphs , 2009, SDM.

[17]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[18]  Ojas Parekh,et al.  A Unified Approach to Approximating Partial Covering Problems , 2006, ESA.

[19]  S. Khuller,et al.  Approximation algorithms for partial covering problems , 2001, J. Algorithms.

[20]  Fabrizio Silvestri,et al.  WebDocs: a real-life huge transactional dataset , 2004, FIMI.

[21]  Bonnie Berger,et al.  Efficient NC Algorithms for Set Cover with Applications to Learning and Geometry , 1994, J. Comput. Syst. Sci..

[22]  T. Grossman,et al.  Computational Experience with Approximation Algorithms for the Set Covering Problem , 1994 .

[23]  Lise Getoor,et al.  On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch , 2009, SDM.

[24]  C. Lee Giles,et al.  Iterative Graph Feature Mining for Graph Indexing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[25]  Vangelis Th. Paschos,et al.  Online maximum k-coverage , 2012, Discret. Appl. Math..

[26]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.