Making WAVES in Breedbase: An Integrated Spectral Data Storage and Analysis Pipeline for Plant Breeding Programs

Visible and near-infrared (vis-NIRS) spectroscopy is a promising tool for increasing phenotyping throughput in plant breeding programs, but existing analysis software packages are not optimized for a breeding context. Additionally, commercial software options are often outside of budget constraints for some breeding and research programs. To that end, we developed an open-source R package, waves, for the streamlined analysis of spectral data with several cross-validation schemes to assess prediction accuracy. Waves is compatible with a wide range of spectrometer models and performs visualization, filtering, aggregation, cross-validation set formation, model training, and prediction functions for the association of vis-NIRS spectra with reference measurements. Furthermore, we have integrated this package into the Breedbase family of open-source databases, expanding the analysis capabilities of this growing digital ecosystem to a number of crop species. Taken together, the standalone and Breedbase versions of waves enhance the accessibility of tools for the analysis of spectral data during the plant breeding process. Core ideas waves is an open-source R package for spectral data analysis in plant breeding Breeding relevant cross-validation schemes to evaluate predictive accuracy of models Extension of Breedbase—an open-source database—to support spectral data storage Graphical user interface developed for implementation of waves in Breedbase

[1]  J. Poland,et al.  Strategies for Selecting Crosses Using Genomic Prediction in Two Wheat Breeding Programs , 2017, The plant genome.

[2]  C. Pasquini Near infrared spectroscopy: A mature analytical technique with new perspectives - A review. , 2018, Analytica chimica acta.

[3]  Jean-Luc Jannink,et al.  Rapid analyses of dry matter content and carotenoids in fresh cassava roots using a portable visible and near infrared spectrometer (Vis/NIRS) , 2017, PloS one.

[4]  Liliane Mouawad,et al.  vSDC: a method to improve early recognition in virtual screening when limited experimental resources are available , 2016, Journal of Cheminformatics.

[5]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[6]  Herman Wold,et al.  Soft modelling: The Basic Design and Some Extensions , 1982 .

[7]  Miguel Lopo,et al.  A Review on the Applications of Portable Near-Infrared Spectrometers in the Agro-Food Industry , 2013, Applied spectroscopy.

[8]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[9]  Uwe Scholz,et al.  BrAPI—an application programming interface for plant breeding applications , 2019, Bioinform..

[10]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[11]  Arllet M. Portugal,et al.  Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice , 2012, Front. Physio..

[12]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[13]  Bjørn-Helge Mevik,et al.  Partial Least Squares and Principal Component Regression [R package pls version 2.7-3] , 2020 .

[14]  Jose Crossa,et al.  Increasing Genomic‐Enabled Prediction Accuracy by Modeling Genotype × Environment Interactions in Kansas Wheat , 2017, The plant genome.

[15]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[16]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[17]  Stuart J. Chalk,et al.  The Open Spectral Database: an open platform for sharing and searching spectral data , 2016, Journal of Cheminformatics.

[18]  Saiful Islam,et al.  Mahalanobis Distance , 2009, Encyclopedia of Biometrics.

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[21]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[22]  F. Leisch FlexMix: A general framework for finite mixture models and latent class regression in R , 2004 .

[23]  W. Fred McClure,et al.  204 Years of near Infrared Technology: 1800–2003 , 2003 .

[24]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .