论文信息 - A Hadoop-Galaxy adapter for user-friendly and scalable data-intensive bioinformatics in Galaxy

A Hadoop-Galaxy adapter for user-friendly and scalable data-intensive bioinformatics in Galaxy

In this work we present a strategy to integrate Hadoop-based applications into the Galaxy platform along with an extensible implementation of this adapter and related utilities. The strategy is based on the idea of introducing a new Galaxy datatype that provides a layer of indirection, thus relaxing the requirement to place data on a Galaxy-accessible file system and instead allowing the referenced data to be placed on any addressable space, including the Hadoop Distributed File System or Amazon S3. The adapter supports using Hadoop-based applications as part of Galaxy workflows. We demonstrate a practical application where this Hadoop-Galaxy adapter was used at CRS4 to accelerate the bioinformatics analysis of viral vector integration sites through the introduction of Hadoop-based computation components, while keeping the workflow under control of biologists with little specific technical training.

[1] Anton Nekrutenko,et al. Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[2] Eija Korpelainen,et al. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[3] Mikko Koski,et al. Chipster: user-friendly analysis software for microarray and other high-throughput data , 2011, BMC Genomics.

[4] Gianluigi Zanetti,et al. Biffi Metachromatic Leukodystrophy Lentiviral Hematopoietic Stem Cell Gene Therapy Benefits , 2013 .

[5] A. Tretyn,et al. Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[6] John Zic,et al. Galaxy + Hadoop: Toward a Collaborative and Scalable Image Processing Toolbox in Cloud , 2013, ICSOC Workshops.

[7] Leighton Pritchard,et al. Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology , 2013, PeerJ.

[8] V. Marx. Biology: The big challenges of big data , 2013, Nature.

[9] Tony Hey,et al. The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[10] Edward A. Lee,et al. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[11] Günther Specht,et al. Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[12] A. Nekrutenko,et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[13] Gianluigi Zanetti,et al. SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[14] Eric Sammer. Hadoop Operations , 2012 .

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] Michael P Snyder,et al. High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[17] Roy D. Sleator,et al. 'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[18] Gianluigi Zanetti,et al. Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[19] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.