Large-Scale Neo-Heterogeneous Programming and Optimization of SNP Detection on Tianhe-2

SNP detection is a fundamental procedure in genome analysis. A popular SNP detection tool SOAPsnp can take more than one week to analyze one human genome with a 20-fold coverage. To improve the efficiency, we developed mSNP, a parallel version of SOAPsnp. mSNP utilizes CPU cooperated with Intel® Xeon PhiTM for large-scale SNP detection. Firstly, we redesigned the key data structure of SOAPsnp, which significantly reduces the overhead of memory operations. Secondly, we devised a coordinated parallel framework, in which CPU collaborates with Xeon Phi for higher hardware utilization. Thirdly, we proposed a read-based window division strategy to improve throughput and parallel scale on multiple nodes. To the best of our knowledge, mSNP is the first SNP detection tool empowered by Xeon Phi. We achieved a 45x speedup on a single node of Tianhe-2, without any loss in precision. Moreover, mSNP showed promising scalability on 4,096 nodes on Tianhe-2.

[1]  Shanrong Zhao,et al.  Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing , 2013, BMC Genomics.

[2]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[3]  Susan L. Graham,et al.  gprof: a call graph execution profiler (with retrospective) , 1982 .

[4]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[5]  Qiong Luo,et al.  GSNP: A DNA Single-Nucleotide Polymorphism Detection System with GPU Acceleration , 2011, 2011 International Conference on Parallel Processing.

[6]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  Pradeep Dubey,et al.  Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Ben Vosman,et al.  HaploSNPer: a web-based allele and SNP detection tool , 2008, BMC Genetics.

[9]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  Jack A. M. Leunissen,et al.  QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species , 2006, BMC Bioinformatics.

[11]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[12]  Manuel Ruiz,et al.  SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects , 2011, BMC Bioinformatics.

[13]  Xiangke Liao,et al.  mBWA: A Massively Parallel Sequence Reads Aligner , 2014, PACBB.

[14]  Srinivas Aluru,et al.  Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[18]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[19]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[20]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[21]  Gagan Agrawal,et al.  Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.