Analysis of Medical Data Using Community Detection on Inferred Networks

Performing network-based analysis on medical and biological data makes a wide variety of machine learning tools available. Clustering, which can be used for classification, presents opportunities for identifying hard-to-reach groups for the development of customized health interventions. Due to a desire to convert abundant DNA gene co-expression data into networks, many graph inference methods have been developed. Likewise there are many clustering and classification tools. This paper presents a comparison of techniques for graph inference and clustering, using different numbers of features, in order to select the best tuple of graph inference method, clustering method, and number of features according to a particular phenotype. An extensive machine learning based analysis of the REGARDS dataset is conducted, evaluating the CoNet and K-Nearest Neighbors (KNN) network inference methods, along with the Louvain, Leiden and NBR-Clust clustering techniques. Results from analysis involving five internal cluster evaluation indices show the traditional KNN inference method and NBR-Clust and Louvain clustering produce the most promising clusters with medical phenotype data. It is also shown that visualization can aid in interpreting the clusters, and that the clusters produced can identify meaningful groups indicating customized interventions.

[1]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[2]  C. Moy,et al.  The Reasons for Geographic and Racial Differences in Stroke Study: Objectives and Design , 2005, Neuroepidemiology.

[3]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[4]  L. Hubert Approximate Evaluation Techniques for the Single-Link and Complete-Link Hierarchical Clustering Procedures , 1974 .

[5]  Justin Bruce,et al.  Using Node-Based Resilience Clustering to Predict and Analyze Medical Data , 2019, 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[6]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[7]  Gunes Ercal,et al.  The vertex attack tolerance of complex networks , 2017, RAIRO Oper. Res..

[8]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tanya Y. Berger-Wolf,et al.  Network Structure Inference, A Survey: Motivations, Methods, and Applications , 2016 .

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  Jing Wang,et al.  Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations , 2016, PloS one.

[12]  Sandeep Yadav,et al.  Detecting Malicious Domains via Graph Inference , 2014, AISec '14.

[13]  Guido Zuccon,et al.  Information retrieval as semantic inference: a Graph Inference model applied to medical search , 2016, Information Retrieval Journal.

[14]  Brian A. King,et al.  Current cigarette smoking among adults - United States, 2005-2014. , 2015, MMWR. Morbidity and mortality weekly report.

[15]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[16]  N. S. Nithya,et al.  A Survey on Clustering Techniques in Medical Diagnosis , 2014 .

[17]  J. Raes,et al.  CoNet app: inference of biological association networks using Cytoscape , 2016, F1000Research.

[18]  Giulio Costantini,et al.  Grandiose and entitled, but still fragile: A network analysis of pathological narcissistic traits , 2019, Personality and Individual Differences.

[19]  Chris Kanich,et al.  Network Model Selection for Task-Focused Attributed Network Inference , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[20]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[21]  Soni Jyoti,et al.  Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , 2011 .

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  Abu Sayed Md. Latiful Hoque,et al.  Clustering medical data to predict the likelihood of diseases , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[24]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[25]  Brian A. King,et al.  Current Cigarette Smoking Among Adults - United States, 2005-2015. , 2016, MMWR. Morbidity and mortality weekly report.

[26]  Shusaku Tsumoto,et al.  Comparison of clustering methods for clinical databases , 2004, Inf. Sci..

[27]  Xing Chen,et al.  MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction , 2018, PLoS Comput. Biol..

[28]  Vincent A. Traag,et al.  From Louvain to Leiden: guaranteeing well-connected communities , 2018, Scientific Reports.

[29]  Tayo Obafemi-Ajayi,et al.  Comparative Analysis of Feature Selection Methods to Identify Biomarkers in a Stroke-Related Dataset , 2019, 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[30]  Michael W Taylor,et al.  Bacterial community collapse: a meta‐analysis of the sinonasal microbiota in chronic rhinosinusitis , 2017, Environmental microbiology.

[31]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[32]  S. MacMahon Alcohol consumption and hypertension. , 1987, Hypertension.

[33]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[34]  Gunes Ercal,et al.  Applications of node-based resilience graph theoretic framework to clustering autism spectrum disorders phenotypes , 2018, Applied Network Science.

[35]  Brian A. King,et al.  Current Cigarette Smoking Among Adults — United States, 2016 , 2015, MMWR. Morbidity and mortality weekly report.

[36]  Gunes Ercal,et al.  Robust Graph-Theoretic Clustering Approaches Using Node-Based Resilience Measures , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[37]  Sue Booth,et al.  Researching Health and Homelessness: Methodological Challenges for Researchers Working with a Vulnerable, Hard to Reach, Transient Population , 1999 .

[38]  Forrest W. Crawford A recruitment model and population size estimation for respondent-driven sampling , 2014 .

[39]  B. Starfield,et al.  Primary care, social inequalities, and all-cause, heart disease, and cancer mortality in US counties, 1990. , 2005, American journal of public health.

[40]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[41]  H.P. Ng,et al.  Medical Image Segmentation Using K-Means Clustering and Improved Watershed Algorithm , 2006, 2006 IEEE Southwest Symposium on Image Analysis and Interpretation.

[42]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[43]  J. Gilbert,et al.  Network-based metabolic analysis and microbial community modeling. , 2016, Current opinion in microbiology.

[44]  David J. Foran,et al.  Using Betweenness Centrality to Identify Manifold Shortcuts , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[45]  T. Farley,et al.  Comparing the Cost-Effectiveness of HIV Prevention Interventions , 2004, Journal of acquired immune deficiency syndromes.