A Hybrid Feature Selection Method for Complex Diseases SNPs

Machine learning techniques have the potential to revolutionize medical diagnosis. Single Nucleotide Polymorphisms (SNPs) are one of the most important sources of human genome variability; thus, they have been implicated in several human diseases. To separate the affected samples from the normal ones, various techniques have been applied on SNPs. Achieving high classification accuracy in such a high-dimensional space is crucial for successful diagnosis and treatment. In this work, we propose an accurate hybrid feature selection method for detecting the most informative SNPs and selecting an optimal SNP subset. The proposed method is based on the fusion of a filter and a wrapper method, i.e., the Conditional Mutual Information Maximization (CMIM) method and the support vector machine-recursive feature elimination, respectively. The performance of the proposed method was evaluated against four state-of-the-art feature selection methods, minimum redundancy maximum relevancy, fast correlation-based feature selection, CMIM, and ReliefF, using four classifiers, support vector machine, naive Bayes, linear discriminant analysis, and $k$ nearest neighbors on five different SNP data sets obtained from the National Center for Biotechnology Information gene expression omnibus genomics data repository. The experimental results demonstrate the efficiency of the adopted feature selection approach outperforming all of the compared feature selection algorithms and achieving up to 96% classification accuracy for the used data set. In general, from these results we conclude that SNPs of the whole genome can be efficiently employed to distinguish affected individuals with complex diseases from the healthy ones.

[1]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[2]  David Page,et al.  Predicting cancer susceptibility from single-nucleotide polymorphism data: a case study in multiple myeloma , 2005, BIOKDD.

[3]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[4]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[5]  Heebal Kim,et al.  Application of LogitBoost Classifier for Traceability Using SNP Chip Data , 2015, PloS one.

[6]  Tapio Salakoski,et al.  Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations , 2012, Algorithms for Molecular Biology.

[7]  Mitsutaka Kadota,et al.  Identification of novel gene amplifications in breast cancer and coexistence of gene amplification with an activating mutation of PIK3CA. , 2009, Cancer research.

[8]  Chia-Hung Liu,et al.  FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization , 2006, Nucleic Acids Res..

[9]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[10]  M. Ng,et al.  SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests , 2012, IEEE Transactions on NanoBioscience.

[11]  Adam C. Winstanley,et al.  Invariant optimal feature selection: A distance discriminant and feature ranking based solution , 2008, Pattern Recognit..

[12]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.

[13]  Zaher Dawy,et al.  A novel gene mapping algorithm based on independent component analysis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[16]  Daniel T. Evans A SNP Microarray Analysis Pipeline Using Machine Learning Techniques , 2010 .

[17]  Suphakant Phimoltares,et al.  Extracting predictive SNPs in Crohn's disease using a vacillating genetic algorithm and a neural classifier in case-control association studies , 2014, Comput. Biol. Medicine.

[18]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[19]  David Zhang,et al.  Feature selection and analysis on correlated gas sensor data with recursive feature elimination , 2015 .

[20]  D. Pinto,et al.  Structural variation of chromosomes in autism spectrum disorder. , 2008, American journal of human genetics.

[21]  Luminita Moruz,et al.  Molecular karyotyping of patients with unexplained mental retardation by SNP arrays: A multicenter study , 2009, Human mutation.

[22]  W. Oetting,et al.  Power of multifactor dimensionality reduction and penalized logistic regression for detecting gene-gene Interaction in a case-control study , 2009, BMC Medical Genetics.

[23]  Suneetha Uppu,et al.  A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[25]  Sejong Oh,et al.  RFS: Efficient feature selection method based on R-value , 2013, Comput. Biol. Medicine.

[26]  Sejong Oh,et al.  CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness , 2012, PloS one.

[27]  Muhammad G. Kibriya,et al.  A Genome-Wide Study of Cytogenetic Changes in Colorectal Cancer Using SNP Microarrays: Opportunities for Future Personalized Treatment , 2012, PloS one.

[28]  Cristina Y. González,et al.  Identification of epistatic interactions through genome-wide association studies in sporadic medullary and juvenile papillary thyroid carcinomas , 2015, BMC Medical Genomics.

[29]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[30]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[31]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[32]  Sejong Oh,et al.  An Efficient Classification for Single Nucleotide Polymorphism (SNP) Dataset , 2013 .

[33]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Yang Cheng-Hong,et al.  Odds ratio-based genetic algorithms for generating SNP barcodes of genotypes to predict disease susceptibility. , 2008 .

[35]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.