Robust linear regression methods in association studies

MOTIVATION It is well known that data deficiencies, such as coding/rounding errors, outliers or missing values, may lead to misleading results for many statistical methods. Robust statistical methods are designed to accommodate certain types of those deficiencies, allowing for reliable results under various conditions. We analyze the case of statistical tests to detect associations between genomic individual variations (SNP) and quantitative traits when deviations from the normality assumption are observed. We consider the classical analysis of variance tests for the parameters of the appropriate linear model and a robust version of those tests based on M-regression. We then compare their empirical power and level using simulated data with several degrees of contamination. RESULTS Data normality is nothing but a mathematical convenience. In practice, experiments usually yield data with non-conforming observations. In the presence of this type of data, classical least squares statistical methods perform poorly, giving biased estimates, raising the number of spurious associations and often failing to detect true ones. We show through a simulation study and a real data example, that the robust methodology can be more powerful and thus more adequate for association studies than the classical approach. AVAILABILITY The code of the robustified version of function lmekin() from the R package kinship is provided as Supplementary Material.

[1]  Elvezio Ronchetti,et al.  Robust C(Alpha)-type Tests for Linear Models. , 1984 .

[2]  D. Gudbjartsson,et al.  Correction: Association of Variants at UMOD with Chronic Kidney Disease and Kidney Stones—Role of Age and Comorbid Diseases , 2010, PLoS Genetics.

[3]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[4]  Pei Wang,et al.  Integrative analysis of DNA copy number and gene expression in metastatic oral squamous cell carcinoma identifies genes associated with poor survival , 2010, Molecular Cancer.

[5]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[6]  References , 1971 .

[7]  Joseph W. McKean,et al.  Computational rank‐based statistics , 2009 .

[8]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .

[9]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[10]  J. Doebley,et al.  The Genetic Architecture of Complex Traits in Teosinte (Zea mays ssp. parviglumis): New Evidence From Association Mapping , 2008, Genetics.

[11]  T. Hettmansperger,et al.  Robust analysis of variance based upon a likelihood ratio criterion , 1980 .

[12]  Robert Valentine,et al.  Epstein-Barr virus-encoded EBNA1 inhibits the canonical NF-κB pathway in carcinoma cells by inhibiting IKK phosphorylation , 2010, Molecular Cancer.

[13]  Kathryn Roeder,et al.  Association studies for quantitative traits in structured populations , 2002, Genetic epidemiology.

[14]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[15]  Lutgarde M. C. Buydens,et al.  Robust ANOVA for microarray data , 2009 .

[16]  A. Pires,et al.  Multiple linear regression with some correlated errors: Classical and robust methods , 2007, Statistics in medicine.

[17]  Richard M. Clark,et al.  Major Regulatory Genes in Maize Contribute to Standing Variation in Teosinte (Zea mays ssp. parviglumis) , 2007, Genetics.

[18]  M. G. Reese,et al.  Improved use of SNP information to detect the role of genes , 2003, Genetic epidemiology.

[19]  D. Neale,et al.  Association Genetics in Pinus taeda L. I. Wood Property Traits , 2007, Genetics.

[20]  R. Iman,et al.  Rank Transformations as a Bridge between Parametric and Nonparametric Statistics , 1981 .

[21]  W. R. Buckland,et al.  Contributions to Probability and Statistics , 1960 .

[22]  N. Schork,et al.  Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. , 2008, American journal of human genetics.

[23]  W. Hoeffding,et al.  Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling , 1961 .

[24]  Kejun Liu,et al.  PowerMarker: an integrated analysis environment for genetic marker analysis , 2005, Bioinform..

[25]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.

[26]  Zhenlin Zhang,et al.  No association of the polymorphisms of the frizzled-related protein gene with peak bone mineral density in Chinese nuclear families , 2010, BMC Medical Genetics.

[27]  Wei Zhao,et al.  Panzea: a database and resource for molecular and functional diversity in the maize genome , 2005, Nucleic Acids Res..

[28]  W. Hoeffding,et al.  Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. , 1962 .

[29]  Stephane Heritier,et al.  Robust Alternatives to the F‐Test in Mixed Linear Models Based on MM‐Estimates , 2007, Biometrics.

[30]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[31]  E. G. Boulding,et al.  Associations between single nucleotide polymorphisms in candidate genes and growth rate in Arctic charr (Salvelinus alpinus L.) , 2003, Heredity.

[32]  Maria-Pia Victoria-Feser,et al.  High-Breakdown Inference for Mixed Linear Models , 2006 .

[33]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[34]  Kathryn Roeder,et al.  Genomic Control for Association Studies Author ( s ) : , 1999 .

[35]  Cavan Reilly,et al.  A semiparametric test to detect associations between quantitative traits and candidate genes in structured populations , 2008, Bioinform..

[36]  Tao Wang,et al.  Improved power by use of a weighted score test for linkage disequilibrium mapping. , 2007, American journal of human genetics.

[37]  G. Box NON-NORMALITY AND TESTS ON VARIANCES , 1953 .

[38]  Xihong Lin,et al.  A powerful and flexible multilocus association test for quantitative traits. , 2008, American journal of human genetics.

[39]  Keyan Zhao,et al.  An Arabidopsis Example of Association Mapping in Structured Samples , 2006, PLoS genetics.

[40]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[41]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[42]  Rongling Wu,et al.  Statistical Genetics of Quantitative Traits: Linkage, Maps and QTL , 2007 .

[43]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[44]  P. J. Huber The 1972 Wald Lecture Robust Statistics: A Review , 1972 .

[45]  Sigbjørn Lien,et al.  Association between SNPs within candidate genes and compounds related to boar taint and reproduction , 2009, BMC Genetics.

[46]  S. Gabriel,et al.  Assessing the impact of population stratification on genetic association studies , 2004, Nature Genetics.

[47]  Kai Wang,et al.  ATOM: a powerful gene-based association test by combining optimally weighted markers , 2009, Bioinform..

[48]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[49]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[50]  John Whittaker,et al.  Analysis of multiple SNPs in a candidate gene or region , 2008, Genetic epidemiology.

[51]  Fei Zou,et al.  Rank-based statistical methodologies for quantitative trait locus mapping. , 2003, Genetics.

[52]  Harald Grallert,et al.  Large effects on body mass index and insulin resistance of fat mass and obesity associated gene (FTO) variants in patients with polycystic ovary syndrome (PCOS) , 2010, BMC Medical Genetics.

[53]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[54]  J. Baskerville A Systematic Study of the Consulting Literature as an Integral Part of Applied Training in Statistics , 1981 .