A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark

A microarray is a revolutionary tool that generates vast volumes of data that describe the expression profiles of genes under investigation that can be qualified as Big Data. Hadoop and Spark are efficient frameworks, developed to store and analyze Big Data. Analyzing microarray data helps researchers to identify correlated genes. Clustering has been successfully applied to analyze microarray data by grouping genes with similar expression profiles into clusters. The complex nature of microarray data obligated clustering methods to employ multiple evaluation functions to ensure obtaining solutions with high quality. This transformed the clustering problem into a Multi-Objective Problem (MOP). A new and efficient hybrid Multi-Objective Whale Optimization Algorithm with Tabu Search (MOWOATS) was proposed to solve MOPs. In this article, MOWOATS is proposed to analyze massive microarray datasets. Three evaluation functions have been developed to ensure an effective assessment of solutions. MOWOATS has been adapted to run in parallel using Spark over Hadoop computing clusters. The quality of the generated solutions was evaluated based on different indices, such as Silhouette and Davies–Bouldin indices. The obtained clusters were very similar to the original classes. Regarding the scalability, the running time was inversely proportional to the number of computing nodes.

[1]  Myong-Jo Kim,et al.  Anti-inflammatory Activity of 1-docosanoyl Cafferate Isolated from Rhus verniciflua in LPS-stimulated BV2 Microglial Cells. , 2011, The Korean journal of physiology & pharmacology : official journal of the Korean Physiological Society and the Korean Society of Pharmacology.

[2]  Martin Odersky,et al.  An Overview of the Scala Programming Language , 2004 .

[3]  Amelia Hallworth,et al.  Computational Biology and Bioinformatics: Gene Regulation , 2017, The Yale Journal of Biology and Medicine.

[4]  Tong Zhou,et al.  Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes , 2007, BMC Bioinformatics.

[5]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[6]  Girish Chandra,et al.  A Column-Wise Distance-Based Approach for Clustering of Gene Expression Data with Detection of Functionally Inactive Genes and Noise , 2019 .

[7]  Kareem Kamal A. Ghany,et al.  A hybrid modified step Whale Optimization Algorithm with Tabu Search for data clustering , 2020, J. King Saud Univ. Comput. Inf. Sci..

[8]  Mohammad Sadegh Helfroush,et al.  A robust gene clustering algorithm based on clonal selection in multiobjective optimization framework , 2018, Expert Syst. Appl..

[9]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[10]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[11]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Abdolreza Hatamlou,et al.  An efficient hybrid clustering method based on improved cuckoo optimization and modified particle swarm optimization algorithms , 2018, Appl. Soft Comput..

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[15]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithms for Clustering - Applications in Data Mining and Bioinformatics , 2011 .

[16]  Kourosh Kiani,et al.  A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray , 2019, Eng. Appl. Artif. Intell..

[17]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[19]  Xin Li,et al.  A Hybrid Multiobjective Particle Swarm Optimization Algorithm Based on R2 Indicator , 2018, IEEE Access.

[20]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[21]  Samuel Kaski,et al.  Modelling-based experiment retrieval: A case study with gene expression clustering , 2015, Bioinform..

[22]  Mario Inostroza-Ponta,et al.  A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies , 2018, BioData Mining.

[23]  Jun Zhang,et al.  An External Archive-Guided Multiobjective Particle Swarm Optimization Algorithm , 2017, IEEE Transactions on Cybernetics.

[24]  Erik Kristiansson,et al.  A novel method for cross-species gene expression analysis , 2013, BMC Bioinformatics.

[25]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[26]  Sraban Kumar Mohanty,et al.  Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph , 2016, Comput. Biol. Medicine.

[27]  Michael W. Berry,et al.  Lecture Notes in Data Mining , 2006 .

[28]  El-Ghazali Talbi,et al.  Metaheuristics - From Design to Implementation , 2009 .

[29]  Ling Tian,et al.  A Parallel DBSCAN Algorithm Based on Spark , 2016, 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom).

[30]  Sudipta Acharya,et al.  Cancer tissue sample classification using point symmetry-based clustering algorithm , 2018 .

[31]  Mohammed Guller Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis , 2015 .

[32]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[33]  Pradipta Maji,et al.  Rough-Fuzzy Clustering for Grouping Functionally Similar Genes from Microarray Data , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Anan Banharnsakun,et al.  A MapReduce-based artificial bee colony for large-scale data clustering , 2017, Pattern Recognit. Lett..

[35]  Anirban Mukhopadhyay,et al.  Multiobjective PSO-based rank aggregation: Application in gene ranking from microarray data , 2017, Inf. Sci..

[36]  Alex Alves Freitas,et al.  A critical review of multi-objective optimization in data mining: a position paper , 2004, SKDD.

[37]  Saeed Jalili,et al.  Single-pass and linear-time k-means clustering based on MapReduce , 2016, Inf. Syst..

[38]  Marco Laumanns,et al.  Scalable multi-objective optimization test problems , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[39]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[40]  Haifa Ben Saber,et al.  A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter , 2015, BioData Mining.

[41]  Pintu Chandra Shill,et al.  A multi-objective genetic algorithm based fuzzy relational clustering for automatic microarray cancer data clustering , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[42]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Mohammed Guller Big Data Analytics with Spark , 2015, Apress.

[44]  Andrew Lewis,et al.  The Whale Optimization Algorithm , 2016, Adv. Eng. Softw..

[45]  Sanghamitra Bandyopadhyay,et al.  Gene expression data clustering using a multiobjective symmetry based clustering technique , 2013, Comput. Biol. Medicine.

[46]  Cees T. A. M. de Laat,et al.  Addressing big data issues in Scientific Data Infrastructure , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[47]  John D. Kelleher,et al.  An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , 2017, MLDM.

[48]  Vincent S. Tseng,et al.  Mining differential top-k co-expression patterns from time course comparative gene expression datasets , 2013, BMC Bioinformatics.

[49]  Kareem Kamal A. Ghany,et al.  A Pareto-Based Hybrid Whale Optimization Algorithm with Tabu Search for Multi-Objective Optimization , 2019, Algorithms.

[50]  Kourosh Kiani,et al.  FWCMR: A scalable and robust fuzzy weighted clustering based on MapReduce with application to microarray gene expression , 2018, Expert Syst. Appl..

[51]  Ka-Chun Wong Computational Biology and Bioinformatics: Gene Regulation , 2018 .

[52]  Xin Yao,et al.  of Birmingham Quality evaluation of solution sets in multiobjective optimisation , 2019 .

[53]  Angelo Dalli Adaptation of the F-measure to Cluster Based Lexicon Quality Evaluation , 2003 .

[54]  Amy K. Schmid,et al.  Clustering gene expression time series data using an infinite Gaussian process mixture model , 2017, bioRxiv.

[55]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Clustering , 2015, ACM Comput. Surv..

[56]  Qingfu Zhang,et al.  Multiobjective optimization Test Instances for the CEC 2009 Special Session and Competition , 2009 .

[57]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Yuanyuan Ma,et al.  Hessian regularization based symmetric nonnegative matrix factorization for clustering gene expression and microbiome data. , 2016, Methods.

[59]  Fred W. Glover,et al.  Future paths for integer programming and links to artificial intelligence , 1986, Comput. Oper. Res..