Scalable Assembly for Massive Genomic Graphs

Scientists increasingly want to assemble large genomes, metagenomes, and large numbers of individual genomes. In order to meet the demand for processing these huge datasets, parallel genome assembly is a vital step. Among all the parallel genome assemblers, de Bruijn graph based ones are most popular. However, the size of de Bruijn graph is determined by the number of distinct kmers used in the algorithm, thus redundant kmers in the genome datasets donot contribute to the graph size. The scalability of genome assemblers is influenced directly by the distinct kmers in the dataset or de Bruijn graph size, rather than the input dataset size. In order to assembly large genomes, we have artificially created 16 datasets of 4 Terabytes in total from the human reference genome. The human reference genome is firstly mutated with a 5% mutation rate, and then subjected to a genome sequencing data simulator ART. The simulated datasets have linearly increasing number of distinct kmers as the size/number of the combined datasets increases. We then evaluate all five time-consuming steps of the SWAP-Assembler 2.0 (SWAP2) using these 16 simulated datasets. Compared with our previous experiment on 1000 human dataset with fixed de Bruijn graph size, the weak-scaling test shows that SWAP2 can scale well from 1024 cores using one dataset to 16,384 cores. The percentage of time usage for all five steps of SWAP2 is fixed, and total time usage is also constant. The result showed that the time usage of graph simplification occupied almost 75% of the total time usage, which will be subject to further optimization for future work.

[1]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[3]  N. Siva 1000 Genomes project , 2008, Nature Biotechnology.

[4]  Pavan Balaji,et al.  SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[5]  David Martin,et al.  Computational Molecular Biology: An Algorithmic Approach , 2001 .

[6]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[7]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[8]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[9]  Leonid Oliker,et al.  Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Pavan Balaji,et al.  SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores , 2014, BMC Bioinformatics.

[11]  Yongchao Liu,et al.  Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.

[12]  冯圣中,et al.  Small World Asynchronous Parallel Model for Genome Assembly , 2012 .

[13]  Xiangke Liao,et al.  High Performance Interconnect Network for Tianhe System , 2015, Journal of Computer Science and Technology.

[14]  Jian Wang,et al.  The YH database: the first Asian diploid genome database , 2008, Nucleic Acids Res..

[15]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[16]  Ümit V. Çatalyürek,et al.  Spaler: Spark and GraphX based de novo genome assembler , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[17]  John West,et al.  Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2016, SC.

[18]  P. Bork,et al.  Richness of human gut microbiome correlates with metabolic markers , 2013, Nature.

[19]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[20]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[21]  Srinivas Aluru,et al.  Parallel short sequence assembly of transcriptomes , 2009, BMC Bioinformatics.

[22]  Frank Dehne,et al.  Randomized parallel list ranking for distributed memory multiprocessors , 1996, International Journal of Parallel Programming.

[23]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[24]  Srinivas Aluru,et al.  Parallel Construction of Bidirected String Graphs for Genome Assembly , 2008, 2008 37th International Conference on Parallel Processing.

[25]  Xiaoming Zhang,et al.  Hybrid hierarchy storage system in MilkyWay-2 supercomputer , 2014, Frontiers of Computer Science.

[26]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[27]  Xue-Jun Yang The TianHe-1 A Supercomputer : Its Hardware and Software , .

[28]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[29]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[30]  Frank Dehne,et al.  Randomized parallel list ranking for distributed memory multiprocessors , 2007, International Journal of Parallel Programming.

[31]  Yasubumi Sakakibara,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2012, Nucleic acids research.

[32]  Peer Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[33]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[34]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[35]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[37]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[38]  Mihai Pop,et al.  Microbiome Metagenomic Analysis of the Human Distal Gut , 2009 .