Causal Genetic Inference Using Haplotypes as Instrumental Variables

In genomic studies with both genotypes and gene or protein expression profile available, causal effects of gene or protein on clinical outcomes can be inferred through using genetic variants as instrumental variables (IVs). The goal of introducing IV is to remove the effects of unobserved factors that may confound the relationship between the biomarkers and the outcome. A valid inference under the IV framework requires pairwise associations and pathway exclusivity. Among these assumptions, the IV expression association needs to be strong for the casual effect estimates to be unbiased. However, a small number of single nucleotide polymorphisms (SNPs) often provide limited explanation of the variability in the gene or protein expression and can only serve as weak IVs. In this study, we propose to replace SNPs with haplotypes as IVs to increase the variant‐expression association and thus improve the casual effect inference of the expression. In the classical two‐stage procedure, we developed a haplotype regression model combined with a model selection procedure to identify optimal instruments. The performance of the new method was evaluated through simulations and compared with the IV approaches using observed multiple SNPs. Our results showed the gain of power to detect a causal effect of gene or protein on the outcome using haplotypes compared with using only observed SNPs, under either complete or missing genotype scenarios. We applied our proposed method to a study of the effect of interleukin‐1 beta (IL‐1β) protein expression on the 90‐day survival following sepsis and found that overly expressed IL‐1β is likely to increase mortality.

[1]  L. Kruglyak,et al.  The role of regulatory variation in complex traits and disease , 2015, Nature Reviews Genetics.

[2]  Hongzhe Li,et al.  Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics , 2013, Journal of the American Statistical Association.

[3]  H. Mischak,et al.  Proteomic urinary biomarker approach in renal disease: from discovery to implementation , 2015, Pediatric Nephrology.

[4]  Mingyao Li,et al.  A functional synonymous coding variant in the IL1RN gene is associated with survival in septic shock. , 2014, American journal of respiratory and critical care medicine.

[5]  George Davey Smith,et al.  Using multiple genetic variants as instrumental variables for modifiable risk factors , 2012, Statistical methods in medical research.

[6]  Wan-Chung Hu Sepsis is a Syndrome with Hyperactivity of TH17-Like Innate Immunity and Hypoactivity of Adaptive Immunity , 2012, 1311.4747.

[7]  G. Kumar,et al.  Nationwide trends of severe sepsis in the 21st century (2000-2007). , 2011, Chest.

[8]  Eric E Schadt,et al.  A Model Selection Approach for Expression Quantitative Trait Loci (eQTL) Mapping , 2011, Genetics.

[9]  Vanessa Didelez,et al.  Assumptions of IV methods for observational epidemiology , 2010, 1011.0595.

[10]  Jeffrey C Barrett,et al.  Haploview: Visualization and analysis of SNP genotype data. , 2009, Cold Spring Harbor protocols.

[11]  Jackie A Cooper,et al.  Inflammation, Insulin Resistance, and Diabetes—Mendelian Randomization Using CRP Haplotypes Points Upstream , 2008, PLoS medicine.

[12]  Jing Zhu,et al.  Apparently low reproducibility of true differential expression discoveries in microarray studies , 2008, Bioinform..

[13]  E. Dermitzakis From gene expression to disease risk , 2008, Nature Genetics.

[14]  Paul J Rathouz,et al.  Two-stage residual inclusion estimation: addressing endogeneity in health econometric modeling. , 2008, Journal of health economics.

[15]  George Davey Smith,et al.  Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology , 2008, Statistics in medicine.

[16]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[17]  Steven J. M. Jones,et al.  Meta-analysis of Colorectal Cancer Gene Expression Profiling Studies Identifies Consistently Reported Candidate Biomarkers , 2008, Cancer Epidemiology Biomarkers & Prevention.

[18]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[19]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[20]  N. Sheehan,et al.  Mendelian randomization as an instrumental variable approach to causal inference , 2007, Statistical methods in medical research.

[21]  Wing-Kin Sung,et al.  Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. , 2007, American journal of human genetics.

[22]  Yun Li,et al.  CFH haplotypes without the Y402H coding variant show strong association with susceptibility to age-related macular degeneration , 2006, Nature Genetics.

[23]  Michael P. Murray Avoiding Invalid Instruments and Coping with Weak Instruments , 2006 .

[24]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[25]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Tom R. Gaunt,et al.  C-reactive protein and its role in metabolic syndrome: mendelian randomisation study , 2005, The Lancet.

[27]  M. Yeh,et al.  Platelet-derived growth factor C induces liver fibrosis, steatosis, and hepatocellular carcinoma. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  D. Schaid Evaluating associations of haplotypes with traits , 2004, Genetic epidemiology.

[29]  G. Davey Smith,et al.  Fibrinogen, C-reactive protein and coronary heart disease: does Mendelian randomization suggest the associations are non-causal? , 2004, QJM : monthly journal of the Association of Physicians.

[30]  S. Ebrahim,et al.  Mendelian randomization: prospects, potentials, and limitations. , 2004, International journal of epidemiology.

[31]  Djillali Annane,et al.  Current epidemiology of septic shock: the CUB-Réa Network. , 2003, American journal of respiratory and critical care medicine.

[32]  S. Hanash Disease proteomics : Proteomics , 2003 .

[33]  S. Hanash,et al.  Disease proteomics , 2003, Nature.

[34]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[35]  Peter H. Westfall,et al.  Testing Association of Statistically Inferred Haplotypes with Discrete and Continuous Traits in Samples of Unrelated Individuals , 2002, Human Heredity.

[36]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[37]  J. Hausman Specification tests in econometrics , 1978 .

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[40]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .