A Self-organizing Deep Auto-Encoder approach for Classification of Complex Diseases using SNP Genomics Data

Abstract Recently, many Machine Learning algorithms have been utilized to identify significant Single Nucleotide Polymorphisms (SNPs) in various human diseases. However, some principal obstacles are challenging in the field of SNP detection and healthy-patient classification. The curse of dimensionality is the main challenge. On the other hand, the number of samples is decidedly smaller than the number of SNPs. In addition, the number of healthy and patient samples can be unequal. These challenges make the feature selection and classification very difficult. The main goal of the current study is the combination of the various algorithms to find out the most effective way of SNP data analysis. Therefore, an efficient method is proposed to identify significant SNPs and classify healthy and patient samples. In this regard, firstly, the Mean Encoding, as an intelligent method, is utilized to convert the nominal SNP data to numeric. Then a two-step filter method is used for feature selection, which removes the irrelevant and redundant features. Finally, the proposed deep auto-encoder is employed to classify so that it can construct its structure based on input data, automatically. To evaluate, we apply the proposed approach to five different SNP datasets, including thyroid cancer, mental retardation, breast cancer, colorectal cancer, and autism, which obtained from the Gene Expression Omnibus (GEO) dataset. The proposed method has succeeded in feature selection and classification so that it can classify healthy and patient samples based on selected features in thyroid cancer, mental retardation, breast cancer, colorectal cancer, and autism with 100%, 94.4%, 100%, 96%, and 99.1% accuracy, respectively. The results indicate that it has succeeded with high efficiency, compared with other published works.

[1]  Amparo Alonso-Betanzos,et al.  Filter Methods for Feature Selection - A Comparative Study , 2007, IDEAL.

[2]  Sri Ramakrishna,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[3]  Mário A. T. Figueiredo,et al.  Efficient feature selection filters for high-dimensional data , 2012, Pattern Recognit. Lett..

[4]  D. Pinto,et al.  Structural variation of chromosomes in autism spectrum disorder. , 2008, American journal of human genetics.

[5]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[6]  Luminita Moruz,et al.  Molecular karyotyping of patients with unexplained mental retardation by SNP arrays: A multicenter study , 2009, Human mutation.

[7]  Mathieu Salzmann,et al.  Learning the Number of Neurons in Deep Networks , 2016, NIPS.

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[12]  Michal Grabowski,et al.  Numerical Coding of Nominal Data , 2015 .

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Mitsutaka Kadota,et al.  Identification of novel gene amplifications in breast cancer and coexistence of gene amplification with an activating mutation of PIK3CA. , 2009, Cancer research.

[15]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  S LewMichael,et al.  Deep learning for visual understanding , 2016 .

[18]  Kedar Potdar,et al.  A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers , 2017 .

[19]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[20]  Abbes Amira,et al.  A Hybrid Feature Selection Method for Complex Diseases SNPs , 2018, IEEE Access.

[21]  Nathalie Japkowicz,et al.  Nonlinear Autoassociation Is Not Equivalent to PCA , 2000, Neural Computation.

[22]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[23]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[25]  Mohammad Teshnehlab,et al.  The Self-Organizing Restricted Boltzmann Machine for Deep Representation with the Application on Classification Problems , 2020, Expert Syst. Appl..

[26]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[27]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29]  Ahmed Guessoum,et al.  Complex diseases SNP selection and classification by hybrid Association Rule Mining and Artificial Neural Network - based Evolutionary Algorithms , 2016, Eng. Appl. Artif. Intell..

[30]  Daniel T. Evans A SNP Microarray Analysis Pipeline Using Machine Learning Techniques , 2010 .

[31]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[32]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[33]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[34]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[35]  Sejong Oh,et al.  An Efficient Classification for Single Nucleotide Polymorphism (SNP) Dataset , 2013 .

[36]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.