Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor

Abstract The major drawback of microarray data is the ‘curse of dimensionality problem’, this hinders the useful information of dataset and leads to computational instability. Therefore, selecting relevant genes is an imperative in microarray data analysis. Most of the existing schemes employ a two-phase processes: feature selection/extraction followed by classification. In this paper, a statistical test, ANOVA based on MapReduce is proposed to select the relevant features. After feature selection, MapReduce based K-Nearest Neighbor (K-NN) classifier is also proposed to classify the microarray data. These algorithms are successfully implemented on Hadoop framework and comparative analysis is done using various datasets.

[1]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[2]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Yufei Huang,et al.  Gene Regulation, Modulation, and Their Applications in Gene Expression Data Analysis , 2013, Adv. Bioinformatics.

[5]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[6]  Mukesh Kumar,et al.  Classification of Microarray Data Using Kernel Fuzzy Inference System , 2014, International scholarly research notices.

[7]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[8]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  S. Shurtleff,et al.  Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[13]  Seokhee Jeon,et al.  MapReduce based parallel gene selection method , 2014, Applied Intelligence.

[14]  D. Cavalieri,et al.  Fundamentals of cDNA microarray data analysis. , 2003, Trends in genetics : TIG.

[15]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[16]  Yike Guo,et al.  Optimising parallel R correlation matrix calculations on gene expression data using MapReduce , 2014, BMC Bioinformatics.

[17]  Ying Liu,et al.  A Hybrid Approach for Biomarker Discovery from Microarray Gene Expression Data for Cancer Classification , 2007, Cancer informatics.

[18]  Vinod Kumar Vavilapalli,et al.  Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 , 2014 .