Using Apache Spark on genome assembly for scalable overlap-graph reduction

De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

[1]  Bahlul Haider,et al.  Omega: an Overlap-graph de novo Assembler for Metagenomics , 2014, Bioinform..

[2]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[3]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[4]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[5]  Ümit V. Çatalyürek,et al.  Spaler: Spark and GraphX based de novo genome assembler , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[6]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[7]  Tae-Hyuk Ahn,et al.  Overlap Graph Reduction for Genome Assembly using Apache Spark , 2017, BCB.

[8]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[11]  Pavan Balaji,et al.  SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[12]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[13]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[14]  Naomi S. Altman,et al.  Comparison of next generation sequencing technologies for transcriptome characterization , 2009, BMC Genomics.

[15]  Xiangqun H. Zheng,et al.  A Whole-Genome Assembly of Drosophila , 2000 .

[16]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[17]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[18]  Seung-Hwan Lim,et al.  SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[19]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[20]  Annelise E Barron,et al.  Advantages and limitations of next‐generation sequencing technologies: A comparison of electrophoresis and non‐electrophoresis methods , 2008, Electrophoresis.

[21]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[22]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[25]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[26]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[27]  Loren H. Rieseberg,et al.  De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms1[C][W][OPEN] , 2014, Plant Physiology.

[28]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..