Running Genome Wide Data Analysis Using a Parallel Approach on a Cloud Platform
暂无分享,去创建一个
Hierarchical Naive Bayes (HNB) is a multivariate classification algorithm that can be used to forecast the probability of a specific disease by analysing a set of Single Nucleotide Polymorphisms (SNPs). In this paper we present the implementation of HNB using a parallel approach based on the Map-Reduce paradigm built natively on the Hadoop framework, relying on the Amazon Cloud Infrastructure. We tested our approach on two GWAS datasets aimed at identifying the genetic bases of Type 1 (T1D) and Type 2 Diabetes (T2D). Both datasets include individual level data of 1,900 cases and 1,500 controls with ~ 420,000 SNPs. For T2D the best results were obtained using the complete set of SNPs, whereas for T1D the best performances were reached using few SNPs selected through standard univariate association tests. Our cloud-based implementation allows running genome wide simulations cutting down computational time and overall infrastructure costs.
[1] Riccardo Bellazzi,et al. Hierarchical Naive Bayes for genetic association studies , 2012, BMC Bioinformatics.
[2] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.