Clustering and network analysis with biological applications

Clustering and network analysis are important areas of research in Computer Science and other disciplines. Clustering is broadly defined as finding sets of similar objects. It has many applications, such as finding groups of similar buyers given their product preferences, and finding groups of similar proteins given their sequences. Network analysis considers data represented by a collection of nodes (vertices), and edges that link these nodes. The structure of the network is studied to find central nodes, identify nodes that are similar to a particular vertex, and find well-connected groups of vertices. The World Wide Web and online social networks are some of the best studied networks today. Network analysis can also be applied to biological networks where nodes are proteins and edges represent relationships or interactions between them. The size of real-world data sets presents many challenges to computational techniques that interpret them. A classic clustering problem is to divide the data set into groups, given the pairwise distances between the objects. However, computing all the pairwise distances may be infeasible if the data set is very large. In this thesis we consider clustering in a limited information setting where we do not know the distances between the objects in advance, and instead must query them during the execution of the algorithm. We present algorithms that find an accurate clustering in this setting using few queries. The networks that we encounter in practice are quite large as well, making computations on the entire network difficult. In this thesis we present techniques for locally exploring networks, which are efficient but still give meaningful information about the local structure of the graph. We develop several tools for locally exploring a network, and show that they give meaningful results when applied to protein networks.

[1]  Ambuj K. Singh,et al.  Analysis of protein-protein interaction networks using random walks , 2005, BIOKDD.

[2]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[3]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[4]  Haiyuan Yu,et al.  Developing a similarity measure in biological function space , 2007 .

[5]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[6]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Shai Ben-David,et al.  A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[8]  Rafail Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[9]  Jonathan A. Kelner Spectral Partitioning, Eigenvalue Bounds, and Circle Packings for Graphs of Bounded Genus , 2006, SIAM J. Comput..

[10]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[11]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Igor Jurisica,et al.  Protein complex prediction via cost-based clustering , 2004, Bioinform..

[15]  Shang-Hua Teng,et al.  A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning , 2008, SIAM J. Comput..

[16]  M. Samanta,et al.  Predicting protein functions from redundancies in large-scale protein interaction networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[18]  Yan Wang,et al.  VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology , 2009, Nucleic Acids Res..

[19]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[20]  Charles J. Alpert,et al.  Spectral Partitioning: The More Eigenvectors, The Better , 1995, 32nd Design Automation Conference.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Shang-Hua Teng,et al.  Spectral partitioning works: planar graphs and finite element meshes , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[23]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[24]  Mark Jerrum,et al.  Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, International Workshop on Graph-Theoretic Concepts in Computer Science.

[25]  Andy M. Yip,et al.  Gene network interconnectedness and the generalized topological overlap measure , 2007, BMC Bioinformatics.

[26]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[27]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[28]  Ambuj K. Singh,et al.  Predicting genetic interactions with random walks on biological networks , 2009, BMC Bioinformatics.

[29]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[30]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[31]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[32]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[33]  D. Spielman,et al.  Spectral partitioning works: planar graphs and finite element meshes , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[34]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[35]  Francis D. Gibbons,et al.  Predicting protein complex membership using probabilistic network reliability. , 2004, Genome research.

[36]  James R. Lee,et al.  Eigenvalue Bounds, Spectral Partitioning, and Metrical Deformations via Flows , 2008, FOCS.

[37]  Andrew B. Kahng,et al.  Spectral Partitioning with Multiple Eigenvectors , 1999, Discret. Appl. Math..

[38]  David Martin,et al.  Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network , 2003, Genome Biology.

[39]  Phillip Bonacich,et al.  Eigenvector-like measures of centrality for asymmetric relations , 2001, Soc. Networks.

[40]  Mark Crovella,et al.  Virtual landmarks for the internet , 2003, IMC '03.

[41]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[42]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[43]  D. Goldberg,et al.  Assessing experimentally derived interactions in a small world , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[45]  Steve Horvath,et al.  Network neighborhood analysis with the multi-node topological overlap measure , 2007, Bioinform..

[46]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[47]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[48]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[49]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[50]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[52]  George Karypis,et al.  Multilevel algorithms for partitioning power-law graphs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[53]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[55]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[56]  Alberto Paccanaro,et al.  Spectral clustering of protein sequences , 2006, Nucleic Acids Research.

[57]  Desmond J. Higham,et al.  GeneRank: Using search engine technology for the analysis of microarray experiments , 2005, BMC Bioinformatics.

[58]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[59]  A. Clauset Finding local community structure in networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[60]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[61]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[62]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[63]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[64]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[65]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[66]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[67]  Kiyoshi Asai,et al.  Accurate extraction of functional associations between proteins based on common interaction partners and common domains , 2005, Bioinform..

[68]  A. Czumaj,et al.  Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[69]  Jingchun Chen,et al.  Detecting functional modules in the yeast protein-protein interaction network , 2006, Bioinform..

[70]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank , 2004, WAW.