Adapting Community Detection Algorithms for Disease Module Identification in Heterogeneous Biological Networks

Biological networks catalog the complex web of interactions happening between different molecules, typically proteins, within a cell. These networks are known to be highly modular, with groups of proteins associated with specific biological functions. Human diseases often arise from the dysfunction of one or more such proteins of the biological functional group. The ability, to identify and automatically extract these modules has implications for understanding the etiology of different diseases as well as the functional roles of different protein modules in disease. The recent DREAM challenge posed the problem of identifying disease modules from six heterogeneous networks of proteins/genes. There exist many community detection algorithms, but all of them are not adaptable to the biological context, as these networks are densely connected and the size of biologically relevant modules is quite small. The contribution of this study is 3-fold: first, we present a comprehensive assessment of many classic community detection algorithms for biological networks to identify non-overlapping communities, and propose heuristics to identify small and structurally well-defined communities—core modules. We evaluated our performance over 180 GWAS datasets. In comparison to traditional approaches, with our proposed approach we could identify 50% more number of disease-relevant modules. Thus, we show that it is important to identify more compact modules for better performance. Next, we sought to understand the peculiar characteristics of disease-enriched modules and what causes standard community detection algorithms to detect so few of them. We performed a comprehensive analysis of the interaction patterns of known disease genes to understand the structure of disease modules and show that merely considering the known disease genes set as a module does not give good quality clusters, as measured by typical metrics such as modularity and conductance. We go on to present a methodology leveraging these known disease genes, to also include the neighboring nodes of these genes into a module, to form good quality clusters and subsequently extract a “gold-standard set” of disease modules. Lastly, we demonstrate, with justification, that “overlapping” community detection algorithms should be the preferred choice for disease module identification since several genes participate in multiple biological functions.

[1]  S. vanDongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[2]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[3]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[4]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  P. Aloy,et al.  A network medicine approach to human disease , 2009, FEBS letters.

[7]  Inderjit S. Dhillon,et al.  Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion , 2015, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[9]  Lenore Cowen,et al.  New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence , 2014, Bioinform..

[10]  J. Doye,et al.  Monodisperse self-assembly in a model with protein-like interactions. , 2009, The Journal of chemical physics.

[11]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[12]  Srinivasan Parthasarathy,et al.  An ensemble framework for clustering protein-protein interaction networks , 2007, ISMB/ECCB.

[13]  S. Brunak,et al.  A scored human protein–protein interaction network to catalyze genomic interpretation , 2017, Nature Methods.

[14]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[15]  Donna K. Slonim,et al.  Assessment of network module identification across complex diseases , 2019, Nature Methods.

[16]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[17]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[18]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[19]  Bonnie Berger,et al.  Struct2Net: Integrating Structure into Protein-Protein Interaction Prediction , 2005, Pacific Symposium on Biocomputing.

[20]  Albert-László Barabási,et al.  A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome , 2015, PLoS Comput. Biol..

[21]  V. Mootha,et al.  Expansion of Biological Pathways Based on Evolutionary Inference , 2014, Cell.

[22]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[23]  Ellen T. Gelfand,et al.  Parallel genome-scale loss of function screens in 216 cancer cell lines for the identification of context-specific genetic dependencies , 2014, Scientific Data.

[24]  Xiaomei Quan,et al.  Survey: Functional Module Detection from Protein-Protein Interaction Networks , 2014, IEEE Transactions on Knowledge and Data Engineering.

[25]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[26]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[27]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Noah M. Daniels,et al.  Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks , 2013, PloS one.

[29]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[30]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[31]  Donna K. Slonim,et al.  Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases , 2018 .

[32]  Joydeep Ghosh,et al.  Cluster Ensembles: Theory and Applications , 2013, Data Clustering: Algorithms and Applications.

[33]  Daniel Marbach,et al.  Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics , 2016, PLoS Comput. Biol..

[34]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[35]  Julio Saez-Rodriguez,et al.  OmniPath: guidelines and gateway for literature-curated signaling pathway resources , 2016, Nature Methods.

[36]  Balaraman Ravindran,et al.  CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks , 2015, IJCAI.

[37]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[38]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.