A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids

Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require uncommon high-end hardware. This paper introduces the Scalable Assembler at Notre Dame (SAND) framework that can achieve significant speedup using large numbers of commodity machines harnessed from clusters, clouds, and grids. SAND interfaces with the Celera open-source assembly toolkit, replacing two independent sequential modules with scalable parallel alternatives: the candidate selector exploits distributed memory capacity, and the sequence aligner exploits distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency. We show results for several data sets ranging from 738 thousand to over 320 million alignments using resources ranging from a small cluster to more than a thousand nodes spanning three institutions.

[1]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[2]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[3]  Srinivas Aluru,et al.  Assembling genomes on large-scale parallel computers , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Larry Carter,et al.  Scheduling strategies for master-slave tasking on heterogeneous processor platforms , 2004, IEEE Transactions on Parallel and Distributed Systems.

[6]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[9]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  Alexander S. Szalay,et al.  Accelerating large-scale data exploration through data diffusion , 2008, DADC '08.

[11]  Sanguthevar Rajasekaran,et al.  Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs , 2010, BMC Bioinformatics.

[12]  David P. Anderson,et al.  A new major SETI project based on Project Serendip data and 100 , 1997 .

[13]  M. Radenkovic Usre Proxy Service in Mygrid. , 2003 .

[14]  Richard D. Schlichting,et al.  Tolerating failures in the bag-of-tasks programming paradigm , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Dave Strenski,et al.  Exploring Accelerating Science Applications with FPGAs , 2007 .

[16]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[17]  Mihai Pop,et al.  Genome Sequence Assembly: Algorithms and Issues , 2002, Computer.

[18]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[19]  Andrew A. Chien,et al.  Entropia: architecture and performance of an enterprise desktop grid system , 2003, J. Parallel Distributed Comput..

[20]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[21]  David Gelernter,et al.  Supercomputing out of recycled garbage: preliminary experience with Piranha , 1992, ICS '92.

[22]  F. C. Kafatos,et al.  Widespread Divergence Between Incipient Anopheles gambiae Species Revealed by Whole Genome Sequences , 2010, Science.

[23]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[24]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[25]  Li Yi,et al.  Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions , 2009, HPDC '09.

[26]  Srinivas Aluru,et al.  Parallel biological sequence alignments on the Cell Broadband Engine , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[27]  Srinivas Aluru,et al.  Parallel short sequence assembly of transcriptomes , 2009, BMC Bioinformatics.

[28]  Douglas Thain,et al.  Highly scalable genome assembly on campus grids , 2009, MTAGS '09.

[29]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[30]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[31]  Paul Havlak,et al.  Improving Phrap-Based Assembly of the Rat Using “Reliable” Overlaps , 2008, PloS one.

[32]  Nicholas Carriero,et al.  Linda and Friends , 1986, Computer.

[33]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[34]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.

[35]  Larry Carter,et al.  Centralized versus Distributed Schedulers for Bag-of-Tasks Applications , 2008, IEEE Transactions on Parallel and Distributed Systems.

[36]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[37]  Michael Roberts,et al.  A Preprocessor for Shotgun Assembly of Large Genomes , 2004, J. Comput. Biol..

[38]  Ewan Birney,et al.  Update of the Anopheles gambiae PEST genome assembly , 2007, Genome Biology.

[39]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[40]  James H. Bullard,et al.  The origin of the Haitian cholera outbreak strain. , 2011, The New England journal of medicine.

[41]  Mihaela M. Martis,et al.  The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[42]  Arnold L. Rosenberg Optimal Schedules for Cycle-Stealing in a Network of Workstations with a Bag-of-Tasks Workload , 2002, IEEE Trans. Parallel Distributed Syst..

[43]  Miron Livny,et al.  Condor and the Grid , 2003 .