论文信息 - Efficient Clustering with Limited Distance Information

Efficient Clustering with Limited Distance Information

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s in S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.

[1] Robert D. Finn,et al. The Pfam protein families database , 2004, Nucleic Acids Res..

[2] Teofilo F. GONZALEZ,et al. Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[3] Shai Ben-David,et al. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[4] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.

[5] Leonard Pitt,et al. Sublinear time approximate clustering , 2001, SODA '01.

[6] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[7] James A. Casbon,et al. Spectral clustering of protein sequences , 2006, Nucleic acids research.

[8] Maria-Florina Balcan,et al. Approximate clustering without the approximation , 2009, SODA.

[9] L. Holm,et al. The Pfam protein families database , 2005, Nucleic Acids Res..

[10] Philip M. Long,et al. Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[11] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[12] Mark Crovella,et al. Virtual landmarks for the internet , 2003, IMC '03.

[13] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[14] Artur Czumaj,et al. Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[15] C. Chothia,et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16] Avrim Blum,et al. Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.