A comparative study of feature ranking methods as dimension reduction technique in Genome-Wide Association Study

In the recent years, Genome-Wide Association Study (GWAS) has been performed by many scientist around the world to find association between genetic profiles of different individuals with the risk of developing certain diseases. GWAS are performed using the Single Nucleotide Polymorphism (SNP) data which represents the genotypes of two different groups of individuals: the case group of individuals with the disease and the control group of individuals without the disease. The very high dimensional SNP data poses challenges in analyzing GWAS result. This issue can be tackled by performing feature ranking to remove non-relevant features for reducing the dimension of the original data. This work compares several feature ranking methods including the chi-square statistics, information gain, recursive feature elimination and Relief algorithm by analyzing the performance of different learning machines combined with the feature ranking. The highest performance is gained by combining recursive feature elimination with linear SVM while the worst performance is shown by the Relief algorithm. The experiments show that the classifiers generally benefit from the feature selection, but that the highest ranked features are not the best classifier.

[1]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[2]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Domenico Conforti,et al.  A novel similarity-measure for the analysis of genetic data in complex phenotypes , 2009, BMC Bioinformatics.

[5]  Michael K. Ng,et al.  SKM-SNP: SNP markers detection method , 2010, J. Biomed. Informatics.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[8]  A. Brookes The essence of SNPs. , 1999, Gene.

[9]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Young Woong Ko,et al.  A study on application of single nucleotide polymorphism and machine learning techniques to diagnosis of chronic hepatitis , 2009, Expert Syst. J. Knowl. Eng..

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.