GW-SEM: A Statistical Package to Conduct Genome-Wide Structural Equation Modeling

Improving the accuracy of phenotyping through the use of advanced psychometric tools will increase the power to find significant associations with genetic variants and expand the range of possible hypotheses that can be tested on a genome-wide scale. Multivariate methods, such as structural equation modeling (SEM), are valuable in the phenotypic analysis of psychiatric and substance use phenotypes, but these methods have not been integrated into standard genome-wide association analyses because fitting a SEM at each single nucleotide polymorphism (SNP) along the genome was hitherto considered to be too computationally demanding. By developing a method that can efficiently fit SEMs, it is possible to expand the set of models that can be tested. This is particularly necessary in psychiatric and behavioral genetics, where the statistical methods are often handicapped by phenotypes with large components of stochastic variance. Due to the enormous amount of data that genome-wide scans produce, the statistical methods used to analyze the data are relatively elementary and do not directly correspond with the rich theoretical development, and lack the potential to test more complex hypotheses about the measurement of, and interaction between, comorbid traits. In this paper, we present a method to test the association of a SNP with multiple phenotypes or a latent construct on a genome-wide basis using a diagonally weighted least squares (DWLS) estimator for four common SEMs: a one-factor model, a one-factor residuals model, a two-factor model, and a latent growth model. We demonstrate that the DWLS parameters and p-values strongly correspond with the more traditional full information maximum likelihood parameters and p-values. We also present the timing of simulations and power analyses and a comparison with and existing multivariate GWAS software package.

[1]  Manuel A. R. Ferreira,et al.  Common variants in the trichohyalin gene are associated with straight hair in Europeans. , 2009, American journal of human genetics.

[2]  K. Meyer,et al.  “SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model , 2012, Genetics.

[3]  Tariq Ahmad,et al.  Meta-analysis and imputation refines the association of 15q25 with smoking quantity , 2010, Nature Genetics.

[4]  D. R. Johnson,et al.  Ordinal measures in multiple indicator models: A simulation study of categorization error. , 1983 .

[5]  Warren W. Kretzschmar,et al.  Sparse whole genome sequencing identifies two loci for major depressive disorder , 2015, Nature.

[6]  A. Agresti Categorical data analysis , 1993 .

[7]  M. Browne Asymptotically distribution-free methods for the analysis of covariance structures. , 1984, The British journal of mathematical and statistical psychology.

[8]  John Fox,et al.  OpenMx: An Open Source Extended Structural Equation Modeling Framework , 2011, Psychometrika.

[9]  John P. Rice,et al.  Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes , 2009, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[10]  Inês Barroso,et al.  Meta-analysis and imputation refines the association of 15q25 with smoking quantity , 2010, Nature Genetics.

[11]  J. Chimka Categorical Data Analysis, Second Edition , 2003 .

[12]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[13]  Cheng-Hsien Li Confirmatory factor analysis with ordinal data: Comparing robust maximum likelihood and diagonally weighted least squares , 2016, Behavior research methods.

[14]  Olga V. Demler,et al.  Prevalence, severity, and comorbidity of 12-month DSM-IV disorders in the National Comorbidity Survey Replication. , 2005, Archives of general psychiatry.

[15]  N. Carragher,et al.  The structure of adolescent psychopathology: a symptom-level analysis , 2015, Psychological Medicine.

[16]  N. Martin,et al.  ADH genotypes and alcohol use and dependence in Europeans. , 1998, Alcoholism, clinical and experimental research.

[17]  Roderick J. A. Little,et al.  The Analysis of Social Science Data with Missing Values , 1989 .

[18]  M. Daly,et al.  Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis , 2013, The Lancet.

[19]  R. P. McDonald,et al.  Some algebraic properties of the Reticular Action Model for moment structures. , 1984, The British journal of mathematical and statistical psychology.

[20]  Karl G. Jöreskog,et al.  LISREL 7: A guide to the program and applications , 1988 .

[21]  M. Stephens A Unified Framework for Association Analysis with Multiple Related Phenotypes , 2013, PloS one.

[22]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[23]  E. Demerath,et al.  Gene-by-age effects on BMI from Birth to Adulthood: The Fels Longitudinal Study , 2013, Obesity.

[24]  Anders D. Børglum,et al.  Genome-wide association study identifies five new schizophrenia loci , 2011, Nature Genetics.

[25]  P. O’Reilly,et al.  MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS , 2012, PloS one.

[26]  M. Miller,et al.  Sample Size Requirements for Structural Equation Models , 2013, Educational and psychological measurement.

[27]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[28]  Diana Mîndril,et al.  Maximum Likelihood (ML) and Diagonally Weighted Least Squares (DWLS) Estimation Procedures: A Comparison of Estimation Bias with Ordinal and Multivariate Non-Normal Data , 2010 .

[29]  Timothy R. Brick,et al.  OpenMx 2.0: Extended Structural Equation and Statistical Modeling , 2015, Psychometrika.

[30]  Conor V. Dolan,et al.  TATES: Efficient Multivariate Genotype-Phenotype Analysis for Genome-Wide Association Studies , 2013, PLoS genetics.

[31]  Jonathan P. Beauchamp,et al.  Genetic variants associated with subjective well-being, depressive symptoms and neuroticism identified through genome-wide analyses , 2016, Nature Genetics.

[32]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[33]  Gordon M. Burghardt,et al.  Squeezing Interval Change From Ordinal Panel Data : Latent Growth Curves With Ordinal Outcomes , 2004 .

[34]  R. Krueger The structure of common mental disorders. , 1999, Archives of general psychiatry.

[35]  R. D. Bock,et al.  Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm , 1981 .

[36]  R. MacCallum,et al.  Power Analysis in Covariance Structure Modeling Using GFI and AGFI. , 1997, Multivariate behavioral research.

[37]  Manuel A. R. Ferreira,et al.  Genetics and population analysis A multivariate test of association , 2009 .

[38]  Lisa J. Martin,et al.  On family-based genome-wide association studies with large pedigrees: observations and recommendations , 2014, BMC Proceedings.

[39]  J. Pell,et al.  Erratum: Genome-wide analysis of over 106 000 individuals identifies 9 neuroticism-associated loci , 2016, Molecular psychiatry.

[40]  Christine DiStefano,et al.  A Comparison of Diagonal Weighted Least Squares Robust Estimation Techniques for Ordinal Data , 2014 .

[41]  J J McArdle,et al.  Structured latent growth curves for twin data. , 2000, Twin research : the official journal of the International Society for Twin Studies.

[42]  K. Jöreskog,et al.  LISREL 8: New Statistical Features , 1999 .

[43]  Kathryn Roeder,et al.  Pleiotropy and principal components of heritability combine to increase power for association analysis , 2008, Genetic epidemiology.

[44]  Terry E. Duncan,et al.  Latent Variable Modeling of Longitudinal and Multilevel Substance Use Data. , 1997, Multivariate behavioral research.

[45]  M. Neale,et al.  An integrated phenomic approach to multivariate allelic association , 2010, European Journal of Human Genetics.

[46]  Wynne W. Chin Issues and Opinion on Structural Equation Modeling by , 2009 .

[47]  S. Heath,et al.  Association between a 15q25 gene variant, smoking quantity and tobacco-related cancers among 17 000 individuals. , 2010, International journal of epidemiology.

[48]  D. Hinds,et al.  Identification of 15 genetic loci associated with risk of major depression in individuals of European descent , 2016, Nature Genetics.

[49]  F. Clavel-Chapelon,et al.  Alcohol consumption and gastric cancer risk in the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. , 2011, The American journal of clinical nutrition.

[50]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[51]  H. Colonius On Keats' generalization of the rasch model , 1977 .

[52]  J. Miles,et al.  A framework for power analysis using a structural equation modelling procedure , 2003, BMC medical research methodology.

[53]  Hilary S. Leeds,et al.  Data use under the NIH GWAS Data Sharing Policy and future directions , 2014, Nature Genetics.

[54]  Eleazar Eskin,et al.  Efficient Multiple-Trait Association and Estimation of Genetic Correlation Using the Matrix-Variate Linear Mixed Model , 2015, Genetics.

[55]  N. Laird Family‐based Association Test (FBAT) , 2011 .

[56]  M. Shevlin,et al.  Competing Factor Models of Child and Adolescent Psychopathology , 2016, Journal of abnormal child psychology.

[57]  Terry E. Duncan,et al.  Alcohol use from ages 9 to 16: A cohort-sequential latent growth model. , 2006, Drug and alcohol dependence.

[58]  Janice M. Fullerton,et al.  Genome-wide association study reveals two new risk loci for bipolar disorder , 2014, Nature Communications.

[59]  Charles P. Peterson,et al.  Genome-wide discovery of maternal effect variants , 2009, BMC proceedings.

[60]  J. Marchini,et al.  A multiple phenotype imputation method for genetic studies , 2016, Nature Genetics.

[61]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[62]  J. Grice Computing and evaluating factor scores , 2001 .

[63]  Kerstin Mueller,et al.  Lisrel 8 Users Reference Guide , 2016 .

[64]  B. Muthén A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators , 1984 .

[65]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[66]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[67]  F. Clavel-Chapelon,et al.  Genetic variation in alcohol dehydrogenase (ADH1A, ADH1B, ADH1C, ADH7) and aldehyde dehydrogenase (ALDH2), alcohol consumption and gastric cancer risk in the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. , 2012, Carcinogenesis.