An automated infrastructure to support high-throughput bioinformatics

The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.

[1]  Wieslawa I. Mentzen,et al.  Genetic Variants Regulating Immune Cell Levels in Health and Disease , 2013, Cell.

[2]  Flavia Palombo,et al.  A Novel Null Homozygous Mutation Confirms CACNA2D2 as a Gene Mutated in Epileptic Encephalopathy , 2013, PloS one.

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Robert P. Davey,et al.  Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics , 2013, Front. Genet..

[5]  Chris Jordan,et al.  Comprehensive data infrastructure for plant bioinformatics , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[6]  Gianluigi Zanetti,et al.  Biffi Metachromatic Leukodystrophy Lentiviral Hematopoietic Stem Cell Gene Therapy Benefits , 2013 .

[7]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[8]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[9]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[10]  Gen-Tao Chiang,et al.  Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute , 2011, BMC Bioinformatics.

[11]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[12]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[13]  Gianmauro Cuccuru,et al.  Variants within the immunoregulatory CBLB gene are associated with multiple sclerosis , 2010, Nature Genetics.

[14]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[15]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[16]  Andrea Pinna,et al.  Orione, a web-based framework for NGS analysis in microbiology , 2014, Bioinform..

[17]  S. Tofanelli,et al.  Low-Pass DNA Sequencing of 1200 Sardinians Reconstructs European Y-Chromosome Phylogeny , 2013, Science.

[18]  Paul Scheet,et al.  A Genome-Wide Association Scan on the Levels of Markers of Inflammation in Sardinians Reveals Associations That Underpin Its Complex Regulation , 2012, PLoS genetics.

[19]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[20]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[21]  Enis Afgan,et al.  BioBlend: automating pipeline analyses within Galaxy and CloudMan , 2013, Bioinform..

[22]  Andrea Superti-Furga,et al.  Exome sequencing identifies CTSK mutations in patients originally diagnosed as intermediate osteopetrosis , 2014, Bone.

[23]  Ola Spjuth,et al.  Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data , 2013, GigaScience.

[24]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[25]  Gianluigi Zanetti,et al.  SNP genotype calling with MapReduce , 2012, MapReduce '12.

[26]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[27]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .