Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

Complex diseases such as various types of cancer and diabetes are conjectured to be triggered and influenced by a combination of genetic and environmental factors. To integrate potential effects from interplay among underlying candidate factors, we propose a new network-based framework to identify effective biomarkers by searching for groups of synergistic risk factors with high predictive power to disease outcome. An interaction network is constructed with node weights representing individual predictive power of candidate factors and edge weights capturing pairwise synergistic interactions among factors. We then formulate this network-based biomarker identification problem as a novel graph optimization model to search for multiple cliques with maximum overall weight, which we denote as the Maximum Weighted Multiple Clique Problem (MWMCP). To achieve optimal or near optimal solutions, both an analytical algorithm based on column generation method and a fast heuristic for large-scale networks have been derived. Our algorithms for MWMCP have been implemented to analyze two biomedical data sets: a Type 1 Diabetes (T1D) data set from the Diabetes Prevention Trial-Type 1 (DPT-1) study, and a breast cancer genomics data set for metastasis prognosis. The results demonstrate that our network-based methods can identify important biomarkers with better prediction accuracy compared to the conventional feature selection that only considers individual effects.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  Jeffrey T. Chang,et al.  Oncogenic pathway signatures in human cancers as a guide to targeted therapies , 2006, Nature.

[3]  Andrzej Pelc,et al.  Distributed probabilistic fault diagnosis for multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[4]  E. Dougherty,et al.  Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity , 2009, PloS one.

[5]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[6]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[7]  N. Sloane Unsolved Problems in Graph Theory Arising from the Study of Codes , 1989 .

[8]  J. Krischer,et al.  Screening strategies for the identification of multiple antibody-positive relatives of individuals with type 1 diabetes. , 2003, The Journal of clinical endocrinology and metabolism.

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Jacques Desrosiers,et al.  Selected Topics in Column Generation , 2002, Oper. Res..

[11]  D. Kumlander,et al.  A new exact algorithm for the maximum-weight clique problem based on a heuristic vertex-coloring and a backtrack search , 2022, International Journal of Global Operations Research.

[12]  B. Bollobás The evolution of random graphs , 1984 .

[13]  Johan Håstad,et al.  Clique is hard to approximate within n/sup 1-/spl epsiv// , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[14]  Takeaki Uno,et al.  Mining complex genotypic features for predicting HIV-1 drug resistance , 2007, Bioinform..

[15]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[16]  Martin W. P. Savelsbergh,et al.  Branch-and-Price: Column Generation for Solving Huge Integer Programs , 1998, Oper. Res..

[17]  F. Pontén,et al.  Novel signatures of cancer‐associated fibroblasts , 2013, International journal of cancer.

[18]  P. O S I T I O N S T A T E M E N T,et al.  Diagnosis and Classification of Diabetes Mellitus , 2011, Diabetes Care.

[19]  Tian Zheng,et al.  Identification of gene interactions associated with disease from gene expression data using synergy networks , 2008, BMC Systems Biology.

[20]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[21]  J. Krischer,et al.  Increasing the Accuracy of Oral Glucose Tolerance Testing and Extending Its Application to Individuals With Normal Glucose Tolerance for the Prediction of Type 1 Diabetes , 2007, Diabetes Care.

[22]  Radu Horaud,et al.  Stereo Correspondence Through Feature Grouping and Maximal Cliques , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  J. Håstad Clique is hard to approximate within n 1-C , 1996 .

[24]  Daniel Brélaz,et al.  New methods to color the vertices of a graph , 1979, CACM.

[25]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[26]  J. Krischer,et al.  Prognostic Performance of Metabolic Indexes in Predicting Onset of Type 1 Diabetes , 2010, Diabetes Care.

[27]  D. Thomas,et al.  Gene–environment-wide association studies: emerging approaches , 2010, Nature Reviews Genetics.

[28]  Patric R. J. Östergård,et al.  A New Algorithm for the Maximum-Weight Clique Problem , 1999, Electron. Notes Discret. Math..

[29]  K. Corrádi,et al.  A combinatorial approach for Keller's conjecture , 1990 .

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[32]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[33]  Panos M. Pardalos,et al.  The maximum clique problem , 1994, J. Glob. Optim..

[34]  Michael A. Trick,et al.  Cliques and clustering: A combinatorial approach , 1998, Oper. Res. Lett..

[35]  Kaizhu Huang,et al.  Enhanced protein fold recognition through a novel data integration approach , 2009, BMC Bioinformatics.

[36]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[37]  D. K. Friesen,et al.  A combinatorial algorithm for calculating ligand binding , 1984 .

[38]  Laurence A. Wolsey,et al.  An exact algorithm for IP column generation , 1994, Oper. Res. Lett..

[39]  W. Symmans,et al.  Breast cancer heterogeneity: evaluation of clonality in primary and metastatic lesions. , 1995, Human pathology.

[40]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[41]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[42]  P. Pardalos,et al.  An exact algorithm for the maximum clique problem , 1990 .

[43]  Juan Cui,et al.  Derivation of stable microarray cancer-differentiating signatures using consensus scoring of multiple random sampling and gene-ranking consistency evaluation. , 2007, Cancer research.

[44]  James R. Schott,et al.  Principles of Multivariate Analysis: A User's Perspective , 2002 .