Memory-efficient Query-driven Community Detection with Application to Complex Disease Associations

Community detection in real-world graphs presents a number of challenges. First, even if the number of detected communities grows linearly with the graph size, it becomes impossible to manually inspect each community for value added to the application knowledge base. Mining for communities with query nodes as knowledge priors could allow for filtering out irrelevant information and for enriching end-users knowledge associated with the problem of interest, such as discovery of genes functionally associated with the Alzheimer’s (AD) biomarker genes. Second, the data-intensive nature of community enumeration challenges current approaches that often assume that the input graph and the detected communities fit in memory. As computer systems scale, DRAM memory sizes are not expected to increase linearly, while technologies such as SSD memories have the potential to provide much higher capacities at a lower power-cost point, and have a much lower latency than disks. Out-of-core algorithms and/or databaseinspired indexing could provide an opportunity for different design optimizations for query-driven community detection algorithms tuned for emerging architectures. Therefore, this work addresses the need for query-driven and memory-efficient community detection. Using maximal cliques as the community definition, due to their high signalto-noise ratio, we propose and systematically compare two contrasting methods: indexed-based and out-of-core. Both methods improve peak memory efficiency as much as 1000X compared to the state-of-the-art. However, the index-based method, which also has a 10-to-100-fold run time reduction, outperforms the out-of-core algorithm in most cases. The achieved scalability enables the discovery of diseases that are known to be or likely associated with Alzheimer’s when the genome-scale network is mined with AD biomarker genes as knowledge priors.

[1]  Andrew J. Saykin,et al.  Functional microRNAs in Alzheimer’s disease and cancer: differential regulation of common mechanisms and pathways , 2013, Front. Gene..

[2]  Thomas C. Wiegers,et al.  Ranking Transitive Chemical-Disease Inferences Using Local Network Topology in the Comparative Toxicogenomics Database , 2012, PloS one.

[3]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[4]  Fang Wei-Kleiner TEDI: Efficient Shortest Path Query Answering on Graphs , 2011, Graph Data Management.

[5]  Aristides Gionis,et al.  The community-search problem and how to plan a successful cocktail party , 2010, KDD.

[6]  F. Sohrabji,et al.  Vascular and metabolic dysfunction in Alzheimer's disease: a review , 2011, Experimental biology and medicine.

[7]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[8]  Haixun Wang,et al.  Online search of overlapping communities , 2013, SIGMOD '13.

[9]  Nagiza F. Samatova,et al.  Theoretical underpinnings for maximal clique enumeration on perturbed graphs , 2010, Theor. Comput. Sci..

[10]  David A. Bader,et al.  GTgraph : A Synthetic Graph Generator Suite , 2006 .

[11]  Jian Pei,et al.  Efficiently indexing shortest paths by exploiting symmetry in graphs , 2009, EDBT '09.

[12]  J. C. de la Torre,et al.  For Personal Use. Only Reproduce with Permission the Lancet Publishing Group. Personal View Is Ad Neurodegenerative or Vascular? Is Ad Neurodegenerative? Is Alzheimer's Disease a Neurodegenerative or a Vascular Disorder? Data, Dogma, and Dialectics , 2022 .

[13]  P Murali Doraiswamy Silent cerebrovascular events and Alzheimer's disease: an overlooked opportunity for prevention? , 2012, The American journal of psychiatry.

[14]  Coenraad Bron,et al.  Finding All Cliques of an Undirected Graph (Algorithm 457) , 1973, Commun. ACM.

[15]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[16]  Nagiza F. Samatova,et al.  From pull-down data to protein interaction networks and complexes with biological relevance. , 2008, Bioinformatics.

[17]  Russell S Kirby,et al.  Autism Spectrum Disorder and Co-occurring Developmental, Psychiatric, and Medical Conditions Among Children in Multiple Populations of the United States , 2010, Journal of developmental and behavioral pediatrics : JDBP.

[18]  Charalampos E. Tsourakakis,et al.  Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees , 2013, KDD.

[19]  Nagiza F. Samatova,et al.  A scalable, parallel algorithm for maximal clique enumeration , 2009, J. Parallel Distributed Comput..