An ensemble framework for clustering protein-protein interaction networks

MOTIVATION Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. RESULTS In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[2]  Mong-Li Lee,et al.  Increasing confidence of protein interactomes using network topological metrics , 2006, Bioinform..

[3]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[4]  Srinivasan Parthasarathy,et al.  Effective pre-processing strategies for functional clustering of a protein-protein interactions network , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[5]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[6]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[7]  Joydeep Ghosh,et al.  Relationship-Based Clustering and Visualization for High-Dimensional Data Mining , 2003, INFORMS J. Comput..

[8]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[9]  Juergen Kast,et al.  Identification of protein‐protein interactions using in vivo cross‐linking and mass spectrometry , 2004, Proteomics.

[10]  Srinivasan Parthasarathy,et al.  Improving Functional Modularity in Protein-Protein Interactions Graphs Using Hub-Induced Subgraphs , 2006, PKDD.

[11]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[12]  M. Dunn,et al.  From Genome to Proteome , 1999 .

[13]  Ana L. N. Fred,et al.  Analysis of consensus partition in cluster ensemble , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[15]  S. Fields,et al.  The two-hybrid system: an assay for protein-protein interactions. , 1994, Trends in genetics : TIG.

[16]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[17]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[18]  P. Kahn From Genome to Proteome: Looking at a Cell's Proteins , 1995 .

[19]  U Schmidt-Erfurth,et al.  [From the genome to the proteome]. , 2001, Der Ophthalmologe : Zeitschrift der Deutschen Ophthalmologischen Gesellschaft.

[20]  Bonnie Berger,et al.  Struct2Net: Integrating Structure into Protein-Protein Interaction Prediction , 2005, Pacific Symposium on Biocomputing.

[21]  Alain Guénoche,et al.  Clustering proteins from interaction networks for the prediction of cellular functions , 2004, BMC Bioinformatics.

[22]  S. Fields,et al.  A novel genetic system to detect protein–protein interactions , 1989, Nature.

[23]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[24]  A. Barabasi,et al.  Functional and topological characterization of protein interaction networks , 2004, Proteomics.

[25]  Yoshihide Hayashizaki,et al.  Interaction generality, a measurement to assess the reliability of a protein-protein interaction. , 2002, Nucleic acids research.

[26]  Magnus Rattray,et al.  PCA learning for sparse high-dimensional data , 2003 .

[27]  Zhenzhen Kou,et al.  Finding Motifs in Protein-Protein Interaction Networks , 2003 .

[28]  S. Fields,et al.  Protein-protein interactions: methods for detection and analysis , 1995, Microbiological reviews.

[29]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[30]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[31]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[32]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[33]  Limsoon Wong,et al.  Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions , 2006, BioDM.

[34]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[35]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[36]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[38]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[39]  Caroline C. Friedel,et al.  Inferring topology from clustering coefficients in protein-protein interaction networks , 2006, BMC Bioinformatics.

[40]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[41]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[42]  Petter Holme,et al.  Subnetwork hierarchies of biochemical pathways , 2002, Bioinform..