Disease Liability Prediction from Large Scale Genotyping Data Using Classifiers with a Reject Option

Genome-wide association studies (GWA) try to identify the genetic polymorphisms associated with variation in phenotypes. However, the most significant genetic variants may have a small predictive power to forecast the future development of common diseases. We study the prediction of the risk of developing a disease given genome-wide genotypic data using classifiers with a reject option, which only make a prediction when they are sufficiently certain, but in doubtful situations may reject making a classification. To test the reliability of our proposal, we used the Wellcome Trust Case Control Consortium (WTCCC) data set, comprising 14,000 cases of seven common human diseases and 3,000 shared controls.

[1]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[4]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[5]  Judy H. Cho,et al.  Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[6]  Anbupalam Thalamuthu,et al.  TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study. , 2007, The New England journal of medicine.

[7]  Peter L. Bartlett,et al.  Classification with a Reject Option using a Hinge Loss , 2008, J. Mach. Learn. Res..

[8]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[9]  P. Visscher,et al.  The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling , 2010, PLoS genetics.

[10]  Edward R. Dougherty,et al.  The peaking phenomenon in the presence of feature-selection , 2008, Pattern Recognit. Lett..

[11]  Nancy R Cook,et al.  Cardiovascular Disease Risk Prediction With and Without Knowledge of Genetic Variation at Chromosome 9p21.3 , 2009, Annals of Internal Medicine.

[12]  Daniel Gianola,et al.  Predicting genetic predisposition in humans: the promise of whole-genome markers , 2010, Nature Reviews Genetics.

[13]  Tianxi Cai,et al.  Joint Effects of Common Genetic Variants on the Risk for Type 2 Diabetes in U.S. Men and Women of European Ancestry , 2009, Annals of Internal Medicine.

[14]  Marco Zaffalon,et al.  Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2 , 2008, J. Mach. Learn. Res..

[15]  Xiayi Ke,et al.  Rheumatoid arthritis susceptibility loci at chromosomes 10p15, 12q13 and 22q13 , 2008, Nature Genetics.

[16]  S. Gabriel,et al.  Whole-genome association study of bipolar disorder , 2008, Molecular Psychiatry.

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[18]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[19]  R. A. Bailey,et al.  Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes , 2007, Nature Genetics.

[20]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[21]  S. Cichon,et al.  A genome-wide association study implicates diacylglycerol kinase eta (DGKH) and several other genes in the etiology of bipolar disorder , 2008, Molecular Psychiatry.

[22]  Alastair Forbes,et al.  Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility , 2007, Nature Genetics.

[23]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[24]  Atul J. Butte,et al.  A Classifier-based approach to identify genetic similarities between diseases , 2009, Bioinform..

[25]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[26]  R. Collins,et al.  Newly identified loci that influence lipid concentrations and risk of coronary artery disease , 2008, Nature Genetics.

[27]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[28]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[29]  Juan José del Coz,et al.  Learning to Predict One or More Ranks in Ordinal Regression Tasks , 2008, ECML/PKDD.

[30]  Blaise Hanczar,et al.  Classification with reject option in gene expression data , 2008, Bioinform..

[31]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[32]  Manuel A. R. Ferreira,et al.  Collaborative genome-wide association analysis supports a role for ANK3 and CACNA1C in bipolar disorder , 2008, Nature Genetics.

[33]  Joseph T. Glessner,et al.  A genome-wide association study identifies KIAA0350 as a type 1 diabetes gene , 2007, Nature.

[34]  D. Strachan,et al.  Rheumatoid arthritis association at 6q23 , 2007, Nature Genetics.

[35]  Christian Gieger,et al.  Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts , 2009, Nature Genetics.

[36]  Juan José del Coz,et al.  Learning Nondeterministic Classifiers , 2009, J. Mach. Learn. Res..