Clique-Finding for Heterogeneity and Multidimensionality in Biomarker Epidemiology Research: The CHAMBER Algorithm

Background Commonly-occurring disease etiology may involve complex combinations of genes and exposures resulting in etiologic heterogeneity. We present a computational algorithm that employs clique-finding for heterogeneity and multidimensionality in biomedical and epidemiological research (the “CHAMBER” algorithm). Methodology/Principal Findings This algorithm uses graph-building to (1) identify genetic variants that influence disease risk and (2) predict individuals at risk for disease based on inherited genotype. We use a set-covering algorithm to identify optimal cliques and a Boolean function that identifies etiologically heterogeneous groups of individuals. We evaluated this approach using simulated case-control genotype-disease associations involving two- and four-gene patterns. The CHAMBER algorithm correctly identified these simulated etiologies. We also used two population-based case-control studies of breast and endometrial cancer in African American and Caucasian women considering data on genotypes involved in steroid hormone metabolism. We identified novel patterns in both cancer sites that involved genes that sulfate or glucuronidate estrogens or catecholestrogens. These associations were consistent with the hypothesized biological functions of these genes. We also identified cliques representing the joint effect of multiple candidate genes in all groups, suggesting the existence of biologically plausible combinations of hormone metabolism genes in both breast and endometrial cancer in both races. Conclusions The CHAMBER algorithm may have utility in exploring the multifactorial etiology and etiologic heterogeneity in complex disease.

[1]  Carl D Langefeld,et al.  Ordered subset analysis in genetic linkage mapping of complex traits , 2004, Genetic epidemiology.

[2]  Jerome H. Friedman,et al.  Rejoinder: Multivariate Adaptive Regression Splines , 1991 .

[3]  V. Vapnik,et al.  Bounds on Error Expectation for Support Vector Machines , 2000, Neural Computation.

[4]  D. Conti,et al.  Bayesian Modeling of Complex Metabolic Pathways , 2003, Human Heredity.

[5]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[6]  Alex Zelikovsky,et al.  2SNP: scalable phasing based on 2-SNP haplotypes , 2006, Bioinform..

[7]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[8]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Aaron Kershenbaum,et al.  A graph-theoretical approach for pattern discovery in epidemiological research , 2007, IBM Syst. J..

[11]  Gustavo Stolovitzky,et al.  Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data , 2004, Bioinform..

[12]  Donald Erlenkotter,et al.  A Dual-Based Procedure for Uncapacitated Facility Location , 1978, Oper. Res..

[13]  J. Liehr Genotoxic effects of estrogens. , 1990, Mutation research.

[14]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[15]  E. Lander The New Genomics: Global Views of Biology , 1996, Science.

[16]  Low-Tone Ho,et al.  Tree-structured supervised learning and the genetics of hypertension. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  T. Rebbeck,et al.  Lack of Effect Modification between Estrogen Metabolism Genotypes and Combined Hormone Replacement Therapy in Postmenopausal Breast Cancer Risk , 2007, Cancer Epidemiology Biomarkers & Prevention.

[18]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[19]  J. Ott,et al.  Scan statistics to scan markers for susceptibility genes. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  M A Province,et al.  Tree‐based recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups , 2001, Genetic epidemiology.

[21]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[24]  D. Tregouet,et al.  Exploration of Multilocus Effects in a Highly Polymorphic Gene, the Apolipoprotein (APOB) Gene, in Relation to Plasma apoB Levels , 2004, Annals of human genetics.

[25]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[26]  M. LeBlanc,et al.  Logic Regression , 2003 .

[27]  Muredach P. Reilly,et al.  Mixed modelling to characterize genotype–phenotype associations , 2005, Statistics in medicine.

[28]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[29]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[30]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[31]  Terry M Therneau,et al.  A partially linear tree‐based regression model for assessing complex joint gene–gene and gene–environment effects , 2007, Genetic epidemiology.

[32]  D. Thomas,et al.  Toxicokinetic genetics: an approach to gene-environment and gene-gene interactions in complex metabolic pathways. , 2004, IARC scientific publications.

[33]  Yu L Pavlov,et al.  Random Forests , 2000 .

[34]  T. Rebbeck,et al.  Pairwise Combinations of Estrogen Metabolism Genotypes in Postmenopausal Breast Cancer Etiology , 2007, Cancer Epidemiology Biomarkers & Prevention.

[35]  T. Rebbeck,et al.  Case-control study of postmenopausal hormone replacement therapy and endometrial cancer. , 2006, American journal of epidemiology.

[36]  T. Rebbeck,et al.  Estrogen sulfation genes, hormone replacement therapy, and endometrial cancer risk. , 2006, Journal of the National Cancer Institute.

[37]  N. Schork,et al.  The future of genetic case-control studies. , 2001, Advances in genetics.

[38]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[39]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[40]  D. Tregouet,et al.  Automated detection of informative combined effects in genetic association studies of complex traits. , 2003, Genome research.

[41]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[42]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[43]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .