Cloud computing-based parallel genetic algorithm for gene selection in cancer classification

Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.

[1]  Mohd Saberi Mohamad,et al.  A Hybrid of Genetic Algorithm and Support Vector Machine for Features Selection and Classification of Gene Expression Microarray , 2005, Int. J. Comput. Intell. Appl..

[2]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[3]  Zhen Lin,et al.  Choosing Snps Using Feature Selection , 2006, J. Bioinform. Comput. Biol..

[4]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[5]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[6]  Che-Lun Hung,et al.  Cloud Computing for Protein-Ligand Binding Site Comparison , 2013, BioMed research international.

[7]  Yuh-Min Chen,et al.  Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method , 2011, Expert Syst. Appl..

[8]  George Coulouris,et al.  Distributed systems - concepts and design , 1988 .

[9]  Marco Cristani,et al.  Infinite Feature Selection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Thilina Gunarathne Hadoop MapReduce v2 Cookbook , 2015 .

[11]  Tzu-Tsung Wong,et al.  Two-stage classification methods for microarray data , 2008, Expert Syst. Appl..

[12]  Abdulhamit Subasi,et al.  Parallelization of genetic algorithms using Hadoop Map/Reduce , 2012, SOCO 2012.

[13]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Yaw-Ling Lin,et al.  Cloud Computing-Based TagSNP Selection Algorithm for Human Genome Data , 2015, International journal of molecular sciences.

[16]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[17]  Sung-Bae Cho,et al.  Efficient huge-scale feature selection with speciated genetic algorithm , 2005 .

[18]  Yu-Ting Hsiao,et al.  Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment , 2014, BMC Systems Biology.

[19]  Michael Negnevitsky,et al.  Artificial Intelligence: A Guide to Intelligent Systems , 2001 .

[20]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..