A Hadoop-Galaxy adapter for user-friendly and scalable data-intensive bioinformatics in Galaxy

In this work we present a strategy to integrate Hadoop-based applications into the Galaxy platform along with an extensible implementation of this adapter and related utilities. The strategy is based on the idea of introducing a new Galaxy datatype that provides a layer of indirection, thus relaxing the requirement to place data on a Galaxy-accessible file system and instead allowing the referenced data to be placed on any addressable space, including the Hadoop Distributed File System or Amazon S3. The adapter supports using Hadoop-based applications as part of Galaxy workflows. We demonstrate a practical application where this Hadoop-Galaxy adapter was used at CRS4 to accelerate the bioinformatics analysis of viral vector integration sites through the introduction of Hadoop-based computation components, while keeping the workflow under control of biologists with little specific technical training.

[1]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[2]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[3]  Mikko Koski,et al.  Chipster: user-friendly analysis software for microarray and other high-throughput data , 2011, BMC Genomics.

[4]  Gianluigi Zanetti,et al.  Biffi Metachromatic Leukodystrophy Lentiviral Hematopoietic Stem Cell Gene Therapy Benefits , 2013 .

[5]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[6]  John Zic,et al.  Galaxy + Hadoop: Toward a Collaborative and Scalable Image Processing Toolbox in Cloud , 2013, ICSOC Workshops.

[7]  Leighton Pritchard,et al.  Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology , 2013, PeerJ.

[8]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[9]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[10]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[11]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[12]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[13]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[14]  Eric Sammer Hadoop Operations , 2012 .

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Michael P Snyder,et al.  High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[17]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[18]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.