Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers.

In genome-wide association studies using single nucleotide polymorphisms (SNPs), typically thousands of SNPs are genotyped, whereas the number of phenotypes for which there is genomic information may be smaller. Atwo-step SNP (feature) selection method was developed, which consisted of filtering (using information gain), and wrapping (using naïve Bayesian classification). This was based on discretization of the continuous phenotypic values. The method was applied to chick early mortality rates (0-14 days of age) on progeny from 201 sires in a commercial broiler line, with the goal of identifying SNPs (over 5000) related to progeny mortality. Sires were clustered into two groups, low and high, according to two arbitrarily chosen mortality rate thresholds. By varying these thresholds, 11 different "case-control" samples were formed, and the SNP selection procedure was applied to each sample. To compare the 11 sets of chosen SNPs, predicted residual sum of squares (PRESS)from a linear model was used. Naive Bayesian classification accuracy was improved over the case without feature selection (from 50% to 90%). Seventeen SNPs in the best case-control group (with smallest PRESS) accounted for 31% of the variance among sire family mortality rates.

[1]  C. R. Henderson ESTIMATION OF VARIANCE AND COVARIANCE COMPONENTS , 1953 .

[2]  D. Collett Modelling Binary Data , 1991 .

[3]  A. Agresti Categorical data analysis , 1993 .

[4]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[5]  D. Stram,et al.  Variance components testing in the longitudinal mixed effects model. , 1994, Biometrics.

[6]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[7]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[8]  Geoffrey Holmes,et al.  Feature selection via the discovery of simple classification rules , 1995 .

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  J. Ott,et al.  Selecting SNPs in two‐stage analysis of disease association data: a model‐free approach , 2000, Annals of human genetics.

[13]  Elizabeth W. Jones,et al.  Genetics: Analysis of Genes and Genomes , 2001 .

[14]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[15]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[16]  J. Dekkers,et al.  Multifactorial genetics: The use of molecular genetics in the improvement of agricultural populations , 2002, Nature Reviews Genetics.

[17]  D. Gianola,et al.  On marker-assisted prediction of genetic value: beyond the ridge. , 2003, Genetics.

[18]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[19]  Hong Zhou,et al.  Naive Bayesian classifier for microarray data , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[20]  Zohar Yakhini,et al.  Methods for Analysis and Visualization of SNP Genotype Data for Complex Diseases , 2002, Pacific Symposium on Biocomputing.

[21]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[22]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[23]  Satoru Miyano,et al.  Case-control study of binary disease trait considering interactions between SNPs and environmental effects using logistic regression , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[24]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[25]  Roded Sharan,et al.  Analysis of SNP-Expression Association Matrices , 2005, CSB.

[26]  Huiqing Liu,et al.  Use of extreme patient samples for outcome prediction from gene expression data , 2005, Bioinform..

[27]  R. Fernando,et al.  Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures , 2006, Genetics.

[28]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[29]  S. Avendaño,et al.  Association of twelve immune-related genes with performance of three broiler lines in two different hygiene environments. , 2006, Poultry science.

[30]  Daniel Gianola,et al.  "Likelihood, Bayesian, and Mcmc Methods in Quantitative Genetics" , 2010 .