Consistent k-Clustering

The study of online algorithms and competitive analysis provides a solid foundation for studying the quality of irrevocable decision making when the data arrives in an online manner. While in some scenarios the decisions are indeed irrevocable, there are many practical situations when changing a previous decision is not impossible, but simply expensive. In this work we formalize this notion and introduce the consistent k-clustering problem. With points arriving online, the goal is to maintain a constant approximate solution, while minimizing the number of reclusterings necessary. We prove a lower bound, showing that Ω(k log n) changes are necessary in the worst case for a wide range of objective functions. On the positive side, we give an algorithm that needs onlyO(k log n) changes to maintain a constant competitive solution, an exponential improvement on the naive solution of reclustering at every time step. Finally, we show experimentally that our approach performs much better than the theoretical bound, with the number of changes growing approximately as O(log n).

[1]  Thomas S. Ferguson,et al.  Who Solved the Secretary Problem , 1989 .

[2]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[3]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[4]  Richard M. Karp,et al.  An optimal algorithm for on-line bipartite matching , 1990, STOC '90.

[5]  Maxim Sviridenko,et al.  An Algorithm for Online K-Means Clustering , 2014, ALENEX.

[6]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[7]  Dimitris Fotakis,et al.  On the Competitive Ratio for Online Facility Location , 2003, Algorithmica.

[8]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[9]  Robert D. Kleinberg,et al.  Secretary Problems with Non-Uniform Arrival Order , 2015, STOC.

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[12]  Anupam Gupta,et al.  Simpler Analyses of Local Search Algorithms for Facility Location , 2008, ArXiv.

[13]  Robert D. Kleinberg A multiple-choice secretary algorithm with applications to online auctions , 2005, SODA '05.

[14]  Amos Fiat,et al.  Competitive Paging Algorithms , 1991, J. Algorithms.

[15]  Dimitris Fotakis On the Competitive Ratio for Online Facility Location , 2007, Algorithmica.

[16]  Joseph Naor,et al.  A Polylogarithmic-Competitive Algorithm for the k-Server Problem , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[17]  Lyle A. McGeoch,et al.  Competitive Algorithms for Server Problems , 1990, J. Algorithms.

[18]  Artur Czumaj,et al.  (1+ Є)-approximation for facility location in data streams , 2013, SODA.

[19]  Mohammad Mahdian,et al.  Online bipartite matching with random arrivals: an approach based on strongly factor-revealing LPs , 2011, STOC '11.

[20]  Aditya Bhaskara,et al.  Distributed Balanced Clustering via Mapping Coresets , 2014, NIPS.

[21]  Joseph Naor,et al.  A Polylogarithmic-Competitive Algorithm for the k-Server Problem , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[22]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[23]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[24]  Aranyak Mehta,et al.  Online Stochastic Matching: Beating 1-1/e , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[25]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[26]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[27]  Russell Bent,et al.  A simple and deterministic competitive algorithm for online facility location , 2004, Inf. Comput..