Genomic selection using principal component regression

Many statistical methods are available for genomic selection (GS) through which genetic values of quantitative traits are predicted for plants and animals using whole-genome SNP data. A large number of predictors with much fewer subjects become a major computational challenge in GS. Principal components regression (PCR) and its derivative, i.e., partial least squares regression (PLSR), provide a solution through dimensionality reduction. In this study, we show that PCR can perform better than PLSR in cross validation. PCR often requires extracting more components to achieve the maximum predictive ability than PLSR and thus may be associated with a higher computational cost. However, application of the HAT method (a strategy of describing the relationship between the fitted and observed response variables with a hat matrix) to PCR circumvents conventional cross validation in testing predictive ability, resulting in substantially improved computational efficiency over PLSR where cross validation is mandatory. Advantages of PCR over PLSR are illustrated with a simulated trait of a hypothetical population and four agronomical traits of a rice population. The benefit of using PCR in genomic selection is further demonstrated in an effort to predict 1000 metabolomic traits and 24,973 transcriptomic traits in the same rice population.

[1]  M. Kendall A course in multivariate analysis , 1958 .

[2]  H. Hotelling The relations of the newer multivariate statistical methods to factor analysis. , 1957 .

[3]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[4]  E. Greenberg Minimum Variance Properties of Principal Component Regression , 1975 .

[5]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[6]  R. Cook Influential Observations in Linear Regression , 1979 .

[7]  J. Mandel Use of the Singular Value Decomposition in Regression Analysis , 1982 .

[8]  T. Næs,et al.  Principal component regression in NIR analysis: Viewpoints, background details and selection of components , 1988 .

[9]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[10]  Paul J. Worsfold,et al.  Comparison of multivariate calibration techniques for the quantification of model process streams using diode-array spectrophotometry , 1994 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  R. Dennis Cook,et al.  Detection of Influential Observation in Linear Regression , 2000, Technometrics.

[13]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[14]  Qifa Zhang,et al.  Genetic dissection of an elite rice hybrid revealed that heterozygotes are not always advantageous for performance. , 2002, Genetics.

[15]  Cai-guo Xu,et al.  Characterization of the main effects, epistatic effects and their environmental interactions of QTLs on the genetic basis of yield traits in rice , 2002, Theoretical and Applied Genetics.

[16]  Peter D. Wentzell,et al.  Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures , 2003 .

[17]  Jinping Hua,et al.  Single-locus heterotic effects and dominance by dominance interactions can adequately explain the genetic basis of heterosis in an elite rice hybrid , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[19]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[20]  B. Hemmateenejad,et al.  A comparative study between PCR and PLS in simultaneous spectrophotometric determination of diphenylamine, aniline, and phenol: Effect of wavelength selection. , 2007, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[21]  P. VanRaden,et al.  Efficient methods to compute genomic predictions. , 2008, Journal of dairy science.

[22]  Daniel Gianola,et al.  Inferring genetic values for quantitative traits non-parametrically. , 2008, Genetics research.

[23]  Henk Bovenhuis,et al.  Sensitivity of methods for estimating breeding values using genetic markers to the number of QTL and distribution of QTL variance , 2010, Genetics Selection Evolution.

[24]  J. Woolliams,et al.  Reducing dimensionality for prediction of genome-wide breeding values , 2009, Genetics Selection Evolution.

[25]  M. Goddard,et al.  Invited review: Genomic selection in dairy cattle: progress and challenges. , 2009, Journal of dairy science.

[26]  Qi Feng,et al.  Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing , 2010, Proceedings of the National Academy of Sciences.

[27]  Aaron J. Lorenz,et al.  Genomic Selection in Plant Breeding , 2011 .

[28]  Michel Dojat,et al.  Temporal and Spatial Independent Component Analysis for fMRI Data Sets Embedded in the AnalyzeFMRI R Package , 2011 .

[29]  D Gianola,et al.  Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins. , 2011, Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie.

[30]  Jinghua Xiao,et al.  Gains in QTL Detection Using an Ultra-High Density SNP Map Based on Population Sequencing Relative to Traditional RFLP/SSR Markers , 2011, PloS one.

[31]  A Legarra,et al.  A comparison of partial least squares (PLS) and sparse PLS regressions in genomic selection in French dairy cattle. , 2012, Journal of dairy science.

[32]  Cai-guo Xu,et al.  Genetic analysis of the metabolome exemplified using a rice population , 2013, Proceedings of the National Academy of Sciences.

[33]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[34]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[35]  Jinghua Xiao,et al.  An expression quantitative trait loci-guided co-expression analysis for constructing regulatory network using a rice recombinant inbred line population , 2014, Journal of experimental botany.

[36]  Shizhong Xu,et al.  Predicting hybrid performance in rice using genomic best linear unbiased prediction , 2014, Proceedings of the National Academy of Sciences.

[37]  Juan P. Steibel,et al.  Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations , 2014, BMC Bioinformatics.

[38]  R. Varshney,et al.  Genomic Selection for Crop Improvement , 2017, Springer International Publishing.

[39]  B. Liquet,et al.  A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data , 2017, 1702.07066.

[40]  Shizhong Xu Predicted Residual Error Sum of Squares of Mixed Models: An Application for Genomic Prediction , 2017, G3: Genes, Genomes, Genetics.

[41]  Chenwu Xu,et al.  Prediction and association mapping of agronomic traits in maize using multiple omic data , 2017, Heredity.