Selection of SNP Subsets for Severity of Beta-thalassaemia Classification Problem

Single-nucleotide polymorphisms (SNPs) are important genetic variables that are very popular in Genome-wide association study at the present time. They are often used in studies related to genetic disorders. A distinctive trait of SNPs is that there are a lot of them since they are variables originated from various positions in a DNA sequence. Unfortunately, the number of samples investigated are usually far fewer than the number of SNPs and so an over-fitting often occurs when one wants to construct a predictive model for classifying a sample into a case or a control. This study investigated a dataset on beta-thalassemia, a common genetic disorder widely found in Thai population. The data in the set are divided into two groups: severe and mild groups. The aims of the study were to develop and evaluate methods for screening and ranking SNPs related to this disorder. The screening methods tested were Chi-squared test (χ2), Information Gain, and Gradient Boosting (GB). The SNPs that were screened in and selected were then used to construct a predictive model for classifying a sample to be either a severe or mild case. The model construction methods tested were Support Vector Machine (SVM), GB, and Naïve Bayes. Several combinations of a screening method and a model construction method were evaluated, and the evaluation results show that the best combination was χ2-SVM which used the number of selected SNPs of 10.

[1]  Qing Yang,et al.  A support vector machine based naive Bayes algorithm for spam filtering , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[2]  Álvaro Alonso Liso Feature selection with Random Forest and Gradient Boosting , 2016 .

[3]  Chunyu Wang,et al.  A gene-based information gain method for detecting gene–gene interactions in case–control studies , 2015, European Journal of Human Genetics.

[4]  Maria B. Baldursdottir Analysis of single nucleotide polymorphisms (SNPs) associated with classical Hodgkin lymphoma in patients with infectious mononucleosis: Identification of a common genetic risk , 2015 .

[5]  Qingyao Wu,et al.  Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests , 2015, BMC Genomics.

[6]  Richard Simon,et al.  Overfitting in prediction models - is it a problem only in high dimensions? , 2013, Contemporary clinical trials.

[7]  Kitsuchart Pasupa,et al.  A Comparison of Dimensionality Reduction Techniques in Virtual Screening , 2013, ICAISC.

[8]  M. Ng,et al.  SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests , 2012, IEEE Transactions on NanoBioscience.

[9]  Shyam Visweswaran,et al.  The application of naive Bayes model averaging to predict Alzheimer's disease from genome-wide data , 2011, J. Am. Medical Informatics Assoc..

[10]  Jing Li,et al.  USVM: Selection of SNPs in Diseases Association Study Using UMDA and SVM , 2010, 2010 4th International Conference on Bioinformatics and Biomedical Engineering.

[11]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[12]  V. Pungpapong,et al.  Case-control genome-wide association study of rheumatoid arthritis from Genetic Analysis Workshop 16 using penalized orthogonal-components regression-linear discriminant analysis , 2009, BMC proceedings.

[13]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[14]  Yusuke Nakamura,et al.  A genome-wide association identified the common genetic variants influence disease severity in β0-thalassemia/hemoglobin E , 2009, Human Genetics.

[15]  A. Chuansumrit,et al.  A scoring system for the classification of β‐thalassemia/Hb E disease severity , 2008, American journal of hematology.

[16]  Li-Yeh Chuang,et al.  Improved tag SNP selection using binary particle swarm optimization , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[17]  Qiong Yang,et al.  Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks , 2007, BMC proceedings.

[18]  Kitsuchart Pasupa Data Mining and Decision Support in Pharmaceutical Databases , 2007 .

[19]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[20]  Xin Jin,et al.  Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles , 2006, BioDM.

[21]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[22]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.