Framework for Parallel Preprocessing of Microarray Data Using Hadoop

Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.

[1]  Feng Luo,et al.  Combining Hadoop and GPU to preprocess large Affymetrix microarray data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[2]  Quentin De Clerck,et al.  Analyzing and Benchmarking Genomic Preprocessing and Batch Effect Removal Methods in Big Data Infrastructure , 2014 .

[3]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[4]  Jérôme Sueur What Is R , 2018 .

[5]  Arianne C Richard,et al.  Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation , 2014, BMC Genomics.

[6]  Dirk DeRoos,et al.  Hadoop For Dummies , 2014 .

[7]  O. Yli-Harja,et al.  DNA microarray data preprocessing , 2004, First International Symposium on Control, Communications and Signal Processing, 2004..

[8]  Mohiuddin Ahmed,et al.  An Advanced Survey on Cloud Computing and State-of-the-art Research Issues , 2012 .

[9]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[10]  Vignesh Prajapati,et al.  Big Data Analytics with R and Hadoop , 2013 .

[11]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[12]  Joel T Dudley,et al.  In silico research in the era of cloud computing , 2010, Nature Biotechnology.

[13]  Hong Yan,et al.  Spectral Pattern Comparison Methods for Cancer Classification Based on Microarray Gene Expression Data , 2006, IEEE Transactions on Circuits and Systems I: Regular Papers.

[14]  Mario Cannataro,et al.  Parallel Pre-processing of Affymetrix Microarray Data , 2010, Euro-Par Workshops.

[15]  Ruben Abagyan,et al.  Algorithms for high-density oligonucleotide array. , 2003, Current opinion in drug discovery & development.

[16]  Ulrich Mansmann,et al.  Parallelized preprocessing algorithms for high-density oligonucleotide arrays , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Mario Cannataro,et al.  Parallel processing of genomics data , 2016 .

[18]  Qingzhong Liu,et al.  A distribution free summarization method for Affymetrix GeneChip arrays. , 2007, Bioinformatics.

[19]  Mario Cannataro,et al.  Bioinformatics and Microarray Data Analysis on the Cloud. , 2016, Methods in molecular biology.

[20]  Joshy George,et al.  Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. , 2006, Cancer research.

[21]  John Okyere,et al.  How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results , 2006, BMC Bioinformatics.

[22]  Pietro Hiram Guzzi,et al.  The role of parallelism, web services and ontologies in bioinformatics and omics data management and analysis , 2013 .

[23]  Mario Cannataro,et al.  Cloud4SNP: Distributed Analysis of SNP Microarray Data on the Cloud , 2013, BCB.

[24]  M. Cannataro Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare , 2009 .

[25]  Vivek Sarkar,et al.  HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.