SpaRC: Scalable Sequence Clustering using Apache Spark

Motivation: Whole genome shotgun based next‐generation transcriptomics and metagenomics studies often generate 100‐1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. Results: Here we describe an Apache Spark‐based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near‐linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next‐generation sequencing experiments, and Apache Spark represents a cost‐effective solution with rapid development/deployment cycles for similar large‐scale sequence data analysis problems. Availability and implementation: https://bitbucket.org/berkeleylab/jgi‐sparc

[1]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[2]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[3]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[4]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[5]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[6]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[7]  Xiandong Meng,et al.  A case study of tuning MapReduce for efficient Bioinformatics in the cloud , 2017, Parallel Comput..

[8]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[9]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[10]  Max Klein,et al.  Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark , 2017, Bioinform..

[11]  Huzefa Rangwala,et al.  A Map-Reduce Framework for Clustering Metagenomes , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[12]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[13]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[14]  Reynold Xin,et al.  GraphFrames: an integrated API for mixing graph and relational queries , 2016, GRADES '16.

[15]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[16]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Xiandong Meng,et al.  A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing , 2014, Scientific Reports.

[18]  J. Hughes,et al.  Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity , 2001, Applied and Environmental Microbiology.

[19]  Ümit V. Çatalyürek,et al.  Spaler: Spark and GraphX based de novo genome assembler , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[20]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[21]  Frank Mueller,et al.  SparkScore: Leveraging Apache Spark for Distributed Genomic Inference , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[22]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[23]  WangJianxin,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015 .

[24]  Jan-Fang Cheng,et al.  Next generation sequencing data of a defined microbial mock community , 2016, Scientific Data.

[25]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[26]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Stephen C. J. Parker,et al.  Accurate and comprehensive sequencing of personal genomes. , 2011, Genome research.

[28]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[29]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.

[30]  Veli Mäkinen,et al.  A framework for space-efficient read clustering in metagenomic samples , 2017, BMC Bioinformatics.

[31]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[32]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[33]  Axel Visel,et al.  the sheep rumen microbiome Methane yield phenotypes linked to differential gene expression in , 2014 .

[34]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[35]  Alberto M. R. Dávila,et al.  SparkBLAST: scalable BLAST processing using in-memory operations , 2017, BMC Bioinformatics.

[36]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[37]  Yi Pan,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015, J. Comput. Biol..

[38]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[39]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[40]  Xingjian Xu,et al.  CloudPhylo: a fast and scalable tool for phylogeny reconstruction. , 2016, Bioinformatics.