Scaling up production in medium and large high-throughput sequencing facilities presents a number of challenges. As the rate of samples to process increases, manually performing and tracking the center’s operations becomes increasingly difficult, costly and error prone, while processing the massive amounts of data poses significant computational challenges. We present our ongoing work to automate and track all data-related procedures at the CRS4 Sequencing and Genotyping Platform, while integrating state-of-the-art processing technologies such as Hadoop, OMERO, iRODS, and Galaxy into our automated workflows. Currently, the core system is in its testing phase and it is on schedule to be in production use at CRS4 by May 2013. The results thus far obtained are encouraging and the authors are confident that the CRS4 Platform will increase its efficiency and capacity thanks to this system. In the near future, the integration components will be released as as open source software.
[1]
A. Nekrutenko,et al.
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
,
2010,
Genome Biology.
[2]
Gianluigi Zanetti,et al.
SEAL: a distributed short read mapping and duplicate removal tool
,
2011,
Bioinform..
[3]
Gianluigi Zanetti,et al.
Pydoop: a Python MapReduce and HDFS API for Hadoop
,
2010,
HPDC '10.
[4]
Reagan Moore,et al.
iRODS Primer: Integrated Rule-Oriented Data System
,
2010,
iRODS Primer.
[5]
Chris Allan,et al.
OME Remote Objects (OMERO): a flexible, model-driven data management system for experimental biology
,
2012,
Nature Methods.