Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

Key messageA method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials.AbstractMulti-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the ‘uniform’ scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the ‘fiber’ scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.

[1]  Jean-Marcel Ribaut,et al.  The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis , 2013, Front. Physiol..

[2]  Evgeny Burnaev,et al.  Gaussian Process Regression for Structured Data Sets , 2015, SLDS.

[3]  José Crossa,et al.  A reaction norm model for genomic selection using high-dimensional genomic and environmental data , 2013, Theoretical and Applied Genetics.

[4]  Hisashi Kashima,et al.  Self-measuring Similarity for Multi-task Gaussian Process , 2011, ICML Unsupervised and Transfer Learning.

[5]  G. Orjeda,et al.  Multi-environment multi-QTL association mapping identifies disease resistance QTL in barley germplasm from Latin America , 2014, Theoretical and Applied Genetics.

[6]  M. Reynolds,et al.  Multi-location testing as a tool to identify plant response to global climate change. , 2010 .

[7]  P. Cornelius,et al.  Sites regression and shifted multiplicative model clustering of cultivar trial sites under heterogeneity of error variances , 1997 .

[8]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[9]  José Crossa,et al.  High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge. , 2012, Journal of integrative plant biology.

[10]  H. Gauch,et al.  Imputing missing yield trial data , 1990, Theoretical and Applied Genetics.

[11]  José Crossa,et al.  Genomic Prediction of Breeding Values when Modeling Genotype × Environment Interaction using Pedigree and Dense Molecular Markers , 2012 .

[12]  H. Piepho Ridge Regression and Extensions for Genomewide Selection in Maize , 2009 .

[13]  José Crossa,et al.  Prediction Assessment of Shrinkage Estimators of Multiplicative Models for Multi-Environment Cultivar Trials , 1999 .

[14]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[15]  M. Yano,et al.  Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms , 2010, BMC Genomics.

[16]  M. Balestre,et al.  Genotypic stability and adaptability in tropical maize based on AMMI and GGE biplot analysis. , 2009, Genetics and molecular research : GMR.

[17]  Andrew R. Leach,et al.  An Introduction to Chemoinformatics , 2003 .

[18]  Daniel Gianola,et al.  Kernel-based whole-genome prediction of complex traits: a review , 2014, Front. Genet..

[19]  José Crossa,et al.  Prediction of Genetic Values of Quantitative Traits in Plant Breeding Using Pedigree and Molecular Markers , 2010, Genetics.

[20]  Lloyd T. Wilson,et al.  Targeting Cultivars onto Rice Growing Environments Using AMMI and SREG GGE Biplot Analyses , 2005 .

[21]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[22]  Raquel A. Defacio,et al.  Characterization of maize populations in different environmental conditions by means of Three-Mode Principal Components Analysis. , 2010 .

[23]  Oliver Stegle,et al.  It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals , 2013, NIPS.

[24]  Julie Josse,et al.  Regularised PCA to denoise and visualise data , 2013, Stat. Comput..

[25]  D. Grattapaglia,et al.  Accelerating the domestication of trees using genomic selection: accuracy of prediction models across ages and environments. , 2012, The New phytologist.

[26]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[27]  T. Isakeit,et al.  Genome Wide Association Study for Drought, Aflatoxin Resistance, and Important Agronomic Traits of Maize Hybrids in the Sub-Tropics , 2015, PloS one.

[28]  Hugh G. Gauch,et al.  Identifying mega-environments and targeting genotypes , 1997 .

[29]  Dit-Yan Yeung,et al.  Multi-Task Learning using Generalized t Process , 2010, AISTATS.

[30]  Jeffrey B. Endelman,et al.  Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP , 2011 .

[31]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[32]  Deniz Akdemir,et al.  Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions , 2013, Theoretical and Applied Genetics.

[33]  Pieter M. Kroonenberg,et al.  Three-way methods for multiattribute genotype × environment data: an illustrated partial survey , 1991 .

[34]  Radford M. Neal Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification , 1997, physics/9701026.

[35]  H. Iwata,et al.  Marker Genotype Imputation in a Low‐Marker‐Density Panel with a High‐Marker‐Density Reference Panel: Accuracy Evaluation in Barley Breeding Lines , 2010 .

[36]  Julie Josse,et al.  Handling missing values in exploratory multivariate data analysis methods , 2012 .

[37]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[38]  Prasanna Bhat,et al.  Marker Imputation in Barley Association Studies , 2009 .

[39]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[40]  A. Zeileis,et al.  Beta Regression in R , 2010 .

[41]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[42]  A. Zeileis,et al.  Beta Regression in R , 2010 .

[43]  Weikai Yan Biplot Analysis of Incomplete Two‐Way Data , 2013 .

[44]  W. Krzanowski,et al.  Imputing missing values in multi-environment trials using the singular value decomposition: An empirical comparison , 2014 .

[45]  K. Basford,et al.  Genotype by environment effects and selection for drought tolerance in tropical maize. II. Three-mode pattern analysis , 1997, Euphytica.

[46]  F. V. van Eeuwijk,et al.  A Mixed-Model Quantitative Trait Loci (QTL) Analysis for Multiple-Environment Trial Data Using Environmental Covariables for QTL-by-Environment Interactions, With an Example in Maize , 2007, Genetics.

[47]  J. Araus,et al.  Field high-throughput phenotyping: the new crop breeding frontier. , 2014, Trends in plant science.

[48]  P. Cornelius,et al.  Statistical Tests and Estimators of Multiplicative Models for Genotype-by-Environment Interaction , 1996 .

[49]  J Crossa,et al.  Genomic prediction in biparental tropical maize populations in water-stressed and well-watered environments using low-density and GBS SNPs , 2014, Heredity.