论文信息 - Active Clustering of Biological Sequences

Active Clustering of Biological Sequences

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.

[1] James A. Casbon,et al. Spectral clustering of protein sequences , 2006, Nucleic acids research.

[2] Robert D. Nowak,et al. Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities , 2011, AISTATS.

[3] Michael Yu,et al. Clustering with or without the approximation , 2013, J. Comb. Optim..

[4] Shai Ben-David,et al. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[5] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.

[6] Santosh S. Vempala,et al. A divide-and-merge methodology for clustering , 2005, PODS '05.

[7] Sanjoy Dasgupta,et al. Rates of convergence for the cluster tree , 2010, NIPS.

[8] Teofilo F. GONZALEZ,et al. Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[9] L. Holm,et al. The Pfam protein families database , 2005, Nucleic Acids Res..

[10] Philip M. Long,et al. Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[11] Leonard Pitt,et al. Sublinear time approximate clustering , 2001, SODA '01.

[12] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[13] Piotr Indyk,et al. Approximate clustering via core-sets , 2002, STOC '02.

[14] Christos Faloutsos,et al. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[15] Maria-Florina Balcan,et al. Min-sum Clustering of Protein Sequences with Limited Distance Information , 2011, SIMBAD.

[16] C. Chothia,et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17] Avrim Blum,et al. Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[18] Amit Kumar,et al. Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[19] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC.

[20] E. Birney,et al. Pfam: the protein families database , 2013, Nucleic Acids Res..

[21] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[22] A. Czumaj,et al. Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[23] Rafail Ostrovsky,et al. The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[24] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[25] Artur Czumaj,et al. Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[26] Maria-Florina Balcan,et al. Approximate clustering without the approximation , 2009, SODA.

[27] Mark Crovella,et al. Virtual landmarks for the internet , 2003, IMC '03.