Breast cancer prediction using genome wide single nucleotide polymorphism data

BackgroundThis paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.ResultsWe first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline.ConclusionsWe anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.

[1]  Michael A. White,et al.  A new feature selection algorithm for two-class classification problems and application to endometrial cancer , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[2]  Daniel Levy,et al.  A genome-wide association study of breast and prostate cancer in the NHLBI's Framingham Heart Study , 2007, BMC Medical Genetics.

[3]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[4]  J. Listgarten,et al.  Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms , 2004, Clinical Cancer Research.

[5]  Adam Prügel-Bennett,et al.  Training HMM structure with genetic algorithm for biological sequence analysis , 2004, Bioinform..

[6]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[7]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[8]  Jack Y. Yang,et al.  A comparative study of different machine learning methods on microarray gene expression data , 2008, BMC Genomics.

[9]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[10]  S. Sams,et al.  Performance of Common Genetic Variants in Breast-Cancer Risk Models , 2011 .

[11]  I. Jolliffe Principal Component Analysis , 2002 .

[12]  E. Ziegel Permutation, Parametric, and Bootstrap Tests of Hypotheses (3rd ed.) , 2005 .

[13]  BMC Bioinformatics , 2005 .

[14]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[15]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[16]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[17]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[18]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[19]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[20]  A. Sigurdsson,et al.  Common variants on chromosome 5p12 confer susceptibility to estrogen receptor–positive breast cancer , 2008, Nature Genetics.

[21]  M. Thun,et al.  Performance of Common Genetic Variants in Breast-cancer Risk Models , 2022 .

[22]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[23]  Sambasivarao Damaraju,et al.  Potential novel candidate polymorphisms identified in genome-wide association study for breast cancer susceptibility , 2011, Human Genetics.

[24]  E. Lander,et al.  Protein secondary structure prediction using nearest-neighbor methods. , 1993, Journal of molecular biology.

[25]  Daniel Birnbaum,et al.  Reasons for breast cancer heterogeneity , 2008, Journal of biology.

[26]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[27]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[28]  S. Scherer,et al.  Contemplating effects of genomic structural variation , 2008, Genetics in Medicine.

[29]  Sunho Lee,et al.  Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data , 2008, Statistical methods in medical research.

[30]  Yoav Freund,et al.  Predicting genetic regulatory response using classification , 2004, ISMB/ECCB.

[31]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[32]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[33]  Stefano Calza,et al.  Gail model for prediction of absolute risk of invasive breast cancer: independent evaluation in the Florence-European Prospective Investigation Into Cancer and Nutrition cohort. , 2006, Journal of the National Cancer Institute.

[34]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[35]  V. Vogel,et al.  2–1 Gail Model for Prediction of Absolute Risk of Invasive Breast Cancer: Independent Evaluation in the Florence–European Prospective Investigation Into Cancer and Nutrition Cohort , 2007 .

[36]  Nature Genetics , 1991, Nature.

[37]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[38]  Simon Parsons,et al.  Bioinformatics: The Machine Learning Approach by P. Baldi and S. Brunak, 2nd edn, MIT Press, 452 pp., $60.00, ISBN 0-262-02506-X , 2004, The Knowledge Engineering Review.

[39]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[40]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[41]  Park,et al.  Open Access Research Article Identification of Type 2 Diabetes-associated Combination of Snps Using Support Vector Machine , 2022 .

[42]  David A. Hinds,et al.  Assessment of Clinical Validity of a Breast Cancer Risk Model Combining Genetic and Clinical Information , 2010, Journal of the National Cancer Institute.

[43]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[44]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2001 .

[45]  L. Newman,et al.  Assessing breast cancer risk: evolution of the Gail Model. , 2006, Journal of the National Cancer Institute.

[46]  W. Willett,et al.  A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1) , 2009, Nature Genetics.

[47]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[48]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[49]  P. Good Permutation, Parametric, and Bootstrap Tests of Hypotheses , 2005 .

[50]  D. Gudbjartsson,et al.  Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor–positive breast cancer , 2007, Nature Genetics.

[51]  P. Gregersen,et al.  Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33 , 2008, Proceedings of the National Academy of Sciences.

[52]  M. Thun,et al.  Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2 , 2009, Nature Genetics.

[53]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[54]  Hagit Shatkay,et al.  F-SNP: computationally predicted functional SNPs for disease association studies , 2007, Nucleic Acids Res..

[55]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.