Active Clustering of Biological Sequences

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.

[1]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[2]  Robert D. Nowak,et al.  Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities , 2011, AISTATS.

[3]  Michael Yu,et al.  Clustering with or without the approximation , 2013, J. Comb. Optim..

[4]  Shai Ben-David,et al.  A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[5]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[6]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[7]  Sanjoy Dasgupta,et al.  Rates of convergence for the cluster tree , 2010, NIPS.

[8]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[9]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[10]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[11]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[12]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[13]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[14]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[15]  Maria-Florina Balcan,et al.  Min-sum Clustering of Protein Sequences with Limited Distance Information , 2011, SIMBAD.

[16]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[18]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[19]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[20]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  A. Czumaj,et al.  Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[23]  Rafail Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  Artur Czumaj,et al.  Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[26]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[27]  Mark Crovella,et al.  Virtual landmarks for the internet , 2003, IMC '03.