Recommendations on e-infrastructures for next-generation sequencing

With ever-increasing amounts of data being produced by next-generation sequencing (NGS) experiments, the requirements placed on supporting e-infrastructures have grown. In this work, we provide recommendations based on the collective experiences from participants in the EU COST Action SeqAhead for the tasks of data preprocessing, upstream processing, data delivery, and downstream analysis, as well as long-term storage and archiving. We cover demands on computational and storage resources, networks, software stacks, automation of analysis, education, and also discuss emerging trends in the field. E-infrastructures for NGS require substantial effort to set up and maintain over time, and with sequencing technologies and best practices for data analysis evolving rapidly it is important to prioritize both processing capacity and e-infrastructure flexibility when making strategic decisions to support the data analysis demands of tomorrow. Due to increasingly demanding technical requirements we recommend that e-infrastructure development and maintenance be handled by a professional service unit, be it internal or external to the organization, and emphasis should be placed on collaboration between researchers and IT professionals.

[1]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[2]  Zhen Ji,et al.  High-throughput DNA sequence data compression , 2015, Briefings Bioinform..

[3]  Ola Spjuth,et al.  Experiences with workflows for automating data-intensive bioinformatics , 2015, Biology Direct.

[4]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[5]  Raffaele Giancarlo,et al.  Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies , 2014, Briefings Bioinform..

[6]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[7]  Mikko Koski,et al.  Chipster: user-friendly analysis software for microarray and other high-throughput data , 2011, BMC Genomics.

[8]  Ola Spjuth,et al.  A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data , 2015, GigaScience.

[9]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[10]  B. Segal,et al.  Grid computing: the European Data Grid Project , 2000, 2000 IEEE Nuclear Science Symposium. Conference Record (Cat. No.00CH37149).

[11]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[12]  Ola Spjuth,et al.  BioImg.org: A Catalog of Virtual Machine Images for the Life Sciences , 2015, Bioinformatics and biology insights.

[13]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[14]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[15]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[16]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.

[17]  Steven Tuecke,et al.  GridFTP: Protocol Extensions to FTP for the Grid , 2001 .

[18]  Ola Spjuth,et al.  Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data , 2013, GigaScience.

[19]  Ewan Birney,et al.  The future of DNA sequence archiving , 2012, GigaScience.

[20]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[21]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[22]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[23]  Paolo Romano,et al.  Automation of in-silico data analysis processes through workflow management systems , 2007, Briefings Bioinform..

[24]  Jennifer Harris,et al.  Genomic cloud computing: legal and ethical points to consider , 2014, European Journal of Human Genetics.

[25]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[26]  D. A. Scannicchio ATLAS Trigger and Data Acquisition: Capabilities and commissioning , 2010 .

[27]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.