论文信息 - SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores -- Practice and Experience

SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores -- Practice and Experience

There is widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these huge sequencing data, which can be Tara bytes or even Peta bytes. Previously our assembly tool, SWAP-Assembler, can scale to 2048 cores on TianHe 1A for human Yanhuang genome. This work is to further scale SWAP-Assembler to millions of cores on Mira. SWAP-Assembler can be divided into 5 steps, and the most time consuming steps are input parallelization, kmer graph construction, graph simplification (edge merging). We optimize these three steps to keep the percentage of time usage in each step constant when the number of cores increases. For the input parallelization step, the input data is divided into virtual fragments with almost equal size, the begin position and end position for each fragment is automatically separated at the beginning symbol of reads. This data blocking strategy plays a central role in adjusting the data size to keep the communication and memory efficiency for the subsequent steps. In kmer graph construction, to prevent the communication efficiency degradation, the message size is kept constant (about 8k bytes) between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step in each round. The memory usage can be also benefited, as only a small part of the input data is processed in each round. Within graph simplification, the major improvement is to combine messages sending & receiving between its two neighbors into one loop in the communication protocol. After integrated with the above optimizations, the new assembly tool is denoted as SWAP-Assembler 2 or SWAP2 for short. In our experiment for 1k human genome dataset, the modified SWAP-Assembler 2 can scale to 16k cores with parallel efficiency of 70%.

Pavan Balaji | Sangmin Seo | Yanjie Wei | Jintao Meng

[1] P. Bork,et al. Enterotypes of the human gut microbiome , 2011, Nature.

[2] François Laviolette,et al. Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[3] M. Pop,et al. Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[4] Huanming Yang,et al. De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[5] Srinivas Aluru,et al. Parallel de novo assembly of large genomes from high-throughput short reads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6] Yasubumi Sakakibara,et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2012, Nucleic acids research.

[7] Hanlee P. Ji,et al. Next-generation DNA sequencing , 2008, Nature Biotechnology.

[8] Steven J. M. Jones,et al. Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[9] P. Bork,et al. Richness of human gut microbiome correlates with metabolic markers , 2013, Nature.

[10] Yongchao Liu,et al. Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.

[11] Mihai Pop,et al. Microbiome Metagenomic Analysis of the Human Distal Gut , 2009 .

[12] Hideaki Tanaka,et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[13] P. Bork,et al. A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.