An Efficient Classification for Single Nucleotide Polymorphism (SNP) Dataset

Recently, a Single Nucleotide Polymorphism (SNP) which is a unit of genetic variations has caught much attention as it is associated with complex diseases. Various machine learning techniques have been applied on SNP data to distinguish human individuals affected with diseases from healthy ones or predict their predisposition. However, due to its data format and enormous feature space SNP analysis is a complicated task. In this research an efficient method is proposed to facilitate the SNP data classification. The aim was to find the most effective way of SNP data analysis by combining various existing techniques. The experiment was conducted on four SNP datasets obtained from the NCBI Gene Expression Omnibus (GEO) website, two of them are from patients with mental disorders and their healthy parents; and the other two are cancer related data. The analysis process consists of three stages: first, reduction of feature space and selection of informative SNPs; next, generation of an artificial feature from the selects SNPs; and last but not least, classification and validation. The proposed approach proved to be effective by distinguishing two groups of individuals with high accuracy, sometimes even reaching 100% preciseness.

[1]  Sejong Oh A new dataset evaluation method based on category overlap , 2011, Comput. Biol. Medicine.

[2]  Manuela Gariboldi,et al.  Integrative approach for prioritizing cancer genes in sporadic colon cancer , 2009, Genes, chromosomes & cancer.

[3]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[4]  David Page,et al.  Predicting cancer susceptibility from single-nucleotide polymorphism data: a case study in multiple myeloma , 2005, BIOKDD.

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  S. Dudoit,et al.  Introduction to Classification in Microarray Experiments , 2003 .

[7]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[8]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[9]  Ron Edgar,et al.  Gene Expression Omnibus ( GEO ) : Microarray data storage , submission , retrieval , and analysis , 2008 .

[10]  Sejong Oh,et al.  Derivation of an artificial gene to improve classification accuracy upon gene selection , 2012, Comput. Biol. Chem..

[11]  D. Pinto,et al.  Structural variation of chromosomes in autism spectrum disorder. , 2008, American journal of human genetics.

[12]  Luminita Moruz,et al.  Molecular karyotyping of patients with unexplained mental retardation by SNP arrays: A multicenter study , 2009, Human mutation.

[13]  Jennifer G. Dy Unsupervised Feature Selection , 2007 .

[14]  Sejong Oh,et al.  RFS: Efficient feature selection method based on R-value , 2013, Comput. Biol. Medicine.

[15]  Mitsutaka Kadota,et al.  Identification of novel gene amplifications in breast cancer and coexistence of gene amplification with an activating mutation of PIK3CA. , 2009, Cancer research.

[16]  Sejong Oh,et al.  CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness , 2012, PloS one.

[17]  Daniel T. Evans A SNP Microarray Analysis Pipeline Using Machine Learning Techniques , 2010 .

[18]  Blaise Hanczar,et al.  Feature construction from synergic pairs to improve microarray-based classification , 2007, Bioinform..

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Adam C. Winstanley,et al.  Invariant optimal feature selection: A distance discriminant and feature ranking based solution , 2008, Pattern Recognit..