Rapid parallel genome indexing with MapReduce

Sequence alignment is one of the most important applications in computational biology, and is used for such diverse tasks as identifying homologous proteins, analyzing gene expression, mapping variations between individuals, or assembling de novo the genome of organism. Except for trivial cases involving just a small number of short sequences, virtually all other sequence alignment tasks rely on a precomputed index of the sequence to accelerate the alignment. Two of the most important index structures are the suffix array, which consists of the lexicographically sorted list of suffixes of a genome, and the closely related Burrows-Wheeler Transform (BWT), which is a permutation of the genome based on the suffix array. Constructing these structures on large sequences, such as the human genome, requires several hours of serial computation and must be performed for each genome, or genome assembly, to be analyzed. Here we present a novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model. We demonstrate the performance of the algorithm by greatly accelerating suffix array and BWT construction on five significant genomes using as many as 120 cores leased from the Amazon Elastic Compute Cloud (EC2), reducing the end-to-end runtime from hours to mere minutes. The source code is available under an open source GPL License at: http://code.google.com/p/genome-indexing/

[1]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[3]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[4]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[5]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[6]  Peter Sanders,et al.  Scalable Parallel Suffix Array Construction , 2006, PVM/MPI.

[7]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[8]  Jimmy J. Lin,et al.  Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce , 2008, WMT@ACL.

[9]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[12]  William F. Smyth,et al.  The performance of linear time suffix sorting algorithms , 2005, Data Compression Conference.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  H. Brunner Annual Review of Genomics and Human Genetics , 2001, European Journal of Human Genetics.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[17]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[18]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[19]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[20]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[21]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.