Clustering with Internal Connectedness

In this paper we study the problem of clustering entities that are described by two types of data: attribute data and relationship data. While attribute data describe the inherent characteristics of the entities, relationship data represent associations among them. Attribute data can be mapped to the Euclidean space, whereas that is not always possible for the relationship data. The relationship data is described by a graph over the vertices with edges denoting relationship between pairs of entities that they connect. We study clustering problems under the model where the relationship data is constrained by 'internal connectedness,' which requires that any two entities in a cluster are connected by an internal path, that is, a path via entities only from the same cluster. We study the k-median and k-means clustering problems under this model. We show that these problems are Ω(log n) hard to approximate and give O(log n) approximation algorithms for specific cases of these problems.

[1]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[2]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[3]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[4]  John Scott What is social network analysis , 2010 .

[5]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[6]  Pamela D. Morrison,et al.  Network Analysis in Marketing , 2004 .

[7]  S. Dasgupta The hardness of k-means clustering , 2008 .

[8]  Weili Wu,et al.  Algorithms for connected set cover problem and fault-tolerant connected set cover problem , 2009, Theor. Comput. Sci..

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[11]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[12]  David P. Williamson,et al.  A general approximation technique for constrained forest problems , 1992, SODA '92.

[13]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[14]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[16]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[20]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[21]  Rong Ge,et al.  Joint cluster analysis of attribute data and relationship data , 2008, ACM Trans. Knowl. Discov. Data.