Parallelized Classification of Cancer Sub-types from Gene Expression Profiles Using Recursive Gene Selection

Cancer is a chronic disease that is caused mainly by irregularities in genes. It is important to identify such oncogenes that cause cancer. Biological data like gene expressions, protein sequences, RNA-sequences, pathway analysis, Pan-cancer analysis and structural biomarkers could aid in cancer diagnosis, classification and prognosis. This research focuses on classifying subtypes of cancer using Microarray Gene Expression (MGE) levels. Nature of MGE data is multidimensional with very few samples. It is necessary to perform dimensionality reduction to select the relevant genes and remove the redundant ones. The Recursive Feature Selection (RFS) method is proposed as it repeatedly performs the gene selection process until the best gene subset is found. The obtained best subset of genes is further employed for classification using different models and evaluated using 10-fold cross-validation. In order to scale for huge amount of gene expression data, the parallelized classification model was explored on the Spark framework. A comparison was drawn between the non-parallelized classification model on Weka and the parallelized classification model on Spark. The results revealed that the parallelized classification model performs better than non-parallelized classification model in terms of accuracy and execution time. Further, the performance of RFS and parallelized classifier was also compared with previous approaches. The proposed RFS and parallelized classifier outperformed previous methods.

[1]  Sean Owen,et al.  Advanced Analytics with Spark: Patterns for Learning from Data at Scale , 2015 .

[2]  Hala M. Alshamlan,et al.  A Study of Cancer Microarray Gene Expression Profile : Objectives and Approaches , 2013 .

[3]  Shomona Gracia Jacob,et al.  Benchmarking Classification Models for Cancer Prediction from Gene Expression Data: A Novel Approach and New Findings , 2013 .

[4]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[5]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[9]  Verónica Bolón-Canedo,et al.  Data classification using an ensemble of filters , 2014, Neurocomputing.

[10]  Madhushri Banerjee,et al.  Privacy preserving feature selection for distributed data using virtual dimension , 2011, CIKM '11.

[11]  Lior Rokach,et al.  Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography , 2009, Comput. Stat. Data Anal..

[12]  Keun Ho Ryu,et al.  Multiclass cancer classification using a feature subset-based ensemble from microRNA expression profiles , 2017, Comput. Biol. Medicine.

[13]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[14]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[15]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.