Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network

BACKGROUND The size of genomics data has been growing rapidly over the last decade. However, the conventional data analysis techniques are incapable of processing this huge amount of data. For the efficient processing of high dimensional datasets, it is essential to develop some new parallel methods. METHODS In this work, a novel distributed method is presented using Map-Reduce (MR)-based approach. The proposed algorithm consists of MR-based Fisher score (mrFScore), MR-based ReliefF (mrRefiefF), and MR-based probabilistic neural network (mrPNN) using a weighted chaotic grey wolf optimization technique (WCGWO). Here, mrFScore, and mrRefiefF methods are introduced for feature selection (FS), and mrPNN is implemented as an effective method for microarray classification. The proper choice of smoothing parameter (σ) plays a major role in the prediction ability of the PNN which is addressed using a novel technique namely, WCGWO. The WCGWO algorithm is used to select the optimal value of σ in PNN. RESULTS These algorithms have been successfully implemented using the Hadoop framework. The proposed model is tested by using three large and one small microarray datasets, and a comparative analysis is carried out with the existing FS and classification techniques. The results suggest that WCGWO-mrPNN can outperform other methods for high dimensional microarray classification. CONCLUSION The effectiveness of the proposed methods are compared with other existing schemes. Experimental results reveal that the proposed scheme is accurate and robust. Hence, the suggested scheme is considered to be a reliable framework for microarray data analysis. SIGNIFICANCE Such a method promotes the application of parallel programming using Hadoop cluster for the analysis of large-scale genomics data, particularly when the dataset is of high dimension.

[1]  María Teresa García-Ordás,et al.  A comparative study on feature selection for a risk prediction model for colorectal cancer , 2019, Comput. Methods Programs Biomed..

[2]  Zhanquan Sun Parallel Feature Selection Based on MapReduce , 2014 .

[3]  Enrique Alba,et al.  Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments , 2016, Appl. Soft Comput..

[4]  Nilanjan Dey,et al.  A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset , 2016, Comput. Methods Programs Biomed..

[5]  Oscar Castillo,et al.  A high-speed interval type 2 fuzzy system approach for dynamic parameter adaptation in metaheuristics , 2019, Eng. Appl. Artif. Intell..

[6]  A. Gandomi,et al.  Probabilistic neural networks , 2020, Handbook of Probabilistic Models.

[7]  Chi-Kan Chen,et al.  The classification of cancer stage microarray data , 2012, Comput. Methods Programs Biomed..

[8]  Oscar Castillo,et al.  A fuzzy hierarchical operator in the grey wolf optimizer algorithm , 2017, Appl. Soft Comput..

[9]  Patricia Melin,et al.  Multi-objective optimization for modular granular neural networks applied to pattern recognition , 2017, Inf. Sci..

[10]  Santanu Kumar Rath,et al.  Classification of microarray using MapReduce based proximal support vector machine classifier , 2015, Knowl. Based Syst..

[11]  Yike Guo,et al.  Optimising parallel R correlation matrix calculations on gene expression data using MapReduce , 2014, BMC Bioinformatics.

[12]  T. Raghunadha Reddy,et al.  Gender Prediction in Author Profiling Using ReliefF Feature Selection Algorithm , 2018 .

[13]  Bodhisattva Dash,et al.  A new optimal gene selection approach for cancer classification using enhanced Jaya-based forest optimization algorithm , 2019, Neural Computing and Applications.

[14]  Hojjat Adeli,et al.  Enhanced probabilistic neural network with local decision circles: A robust classifier , 2010, Integr. Comput. Aided Eng..

[15]  José García-Nieto,et al.  Parallel multi-swarm optimizer for gene selection in DNA microarrays , 2011, Applied Intelligence.

[16]  Crina Grosan,et al.  Experienced Gray Wolf Optimization Through Reinforcement Learning and Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Andrew Lewis,et al.  Grey Wolf Optimizer , 2014, Adv. Eng. Softw..

[18]  Y. V. Lokeswari,et al.  Prediction of Child Tumours from Microarray Gene Expression Data Through Parallel Gene Selection and Classification on Spark , 2017 .

[19]  Sambit Bakshi,et al.  A memetic algorithm using emperor penguin and social engineering optimization for medical data classification , 2019, Appl. Soft Comput..

[20]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[21]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[22]  Werner Dubitzky,et al.  Multiclass Cancer Classification Using Gene Expression Profiling and Probabilistic Neural Networks , 2002, Pacific Symposium on Biocomputing.

[23]  Gil Alterovitz,et al.  Incremental wrapper based gene selection with Markov blanket , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Seokhee Jeon,et al.  MapReduce based parallel gene selection method , 2014, Applied Intelligence.

[25]  Bo-Wei Chen Incomplete data classification - Fisher Discriminant Ratios versus Welch Discriminant Ratios , 2020, Future Gener. Comput. Syst..

[26]  Amir Hossein Gandomi,et al.  Chaotic bat algorithm , 2014, J. Comput. Sci..

[27]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[28]  Shuigeng Zhou,et al.  CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets , 2014, Genom. Proteom. Bioinform..

[29]  José Soria,et al.  Constrained Real-Parameter Optimization Using the Firefly Algorithm and the Grey Wolf Optimizer , 2020, Hybrid Intelligent Systems in Control, Pattern Recognition and Medicine.

[30]  Parham Pahlavani,et al.  An efficient modified grey wolf optimizer with Lévy flight for optimization tasks , 2017, Appl. Soft Comput..

[31]  Oscar Castillo,et al.  A Grey Wolf Optimizer for Modular Granular Neural Networks for Human Recognition , 2017, Comput. Intell. Neurosci..

[32]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[33]  Fuzhen Zhuang,et al.  Parallel feature selection using positive approximation based on MapReduce , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[34]  Gabriel Antoniu,et al.  Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling , 2017, Future Gener. Comput. Syst..

[35]  Aslam P. Memon,et al.  A new optimal feature selection algorithm for classification of power quality disturbances using discrete wavelet transform and probabilistic neural network , 2017 .

[36]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[37]  Swati Vipsita,et al.  Chaotic emperor penguin optimised extreme learning machine for microarray cancer classification. , 2020, IET systems biology.

[38]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Asghar Akbari Foroud,et al.  Comprehensive identification of multiple harmonic sources using fuzzy logic and adjusted probabilistic neural network , 2017, Neural Computing and Applications.

[40]  Torsten Haferlach,et al.  An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: the Microarray Innovations in LEukemia study prephase , 2008, British journal of haematology.

[41]  Torsten Haferlach,et al.  Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. , 2009, Blood.

[42]  Mohammed Azmi Al-Betar,et al.  A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with β-hill climbing , 2018, Applied Intelligence.

[43]  Santanu Kumar Rath,et al.  Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier , 2016, J. Biomed. Informatics.

[44]  Maciej Kusy,et al.  Application of Reinforcement Learning Algorithms for the Adaptive Computation of the Smoothing Parameter for Probabilistic Neural Network , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[45]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[46]  S. Shurtleff,et al.  Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[47]  Sankalap Arora,et al.  Chaotic grey wolf optimization algorithm for constrained optimization problems , 2018, J. Comput. Des. Eng..

[48]  Shaoning Pang,et al.  Classification consistency analysis for bootstrapping gene selection , 2007, Neural Computing and Applications.