论文信息 - The nf-core framework for community-curated bioinformatics pipelines

The nf-core framework for community-curated bioinformatics pipelines

To the Editor — The standardization, portability and reproducibility of analysis pipelines are key issues within the bioinformatics community. Most bioinformatics pipelines are designed for use on-premises; as a result, the associated software dependencies and execution logic are likely to be tightly coupled with proprietary computing environments. This can make it difficult or even impossible for others to reproduce the ensuing results, which is a fundamental requirement for the validation of scientific findings. Here, we introduce the nf-core framework as a means for the development of collaborative, peerreviewed, best-practice analysis pipelines (Fig. 1). All nf-core pipelines are written in Nextflow and so inherit the ability to be executed on most computational infrastructures, as well as having native support for container technologies such as Docker and Singularity. The nf-core community (Supplementary Fig. 1) has developed a suite of tools that automate pipeline creation, testing, deployment and synchronization. Our goal is to provide a framework for high-quality bioinformatics pipelines that can be used across all institutions and research facilities. Being able to reproduce scientific results is the central tenet of the scientific method. However, moving toward FAIR (findable, accessible, interoperable and reusable) research methods1 in data-driven science is complex2,3. Central repositories, such as bio. tools4, omictools5 and the Galaxy toolshed6, make it possible to find existing pipelines and their associated tools. However, it is still notoriously challenging to develop analysis pipelines that are fully reproducible and interoperable across multiple systems and institutions — primarily because of differences in hardware, operating systems and software versions. Although the recommended guidelines for some analysis pipelines have become standardized (for example, GATK best practices7), the actual implementations are usually developed on a case-by-case basis. As such, there is often little incentive to test, document and implement pipelines in a way that permits their reuse by other researchers. This can hamper sustainable sharing of data and tools, and results in a proliferation of heterogeneous analysis pipelines, making it difficult for newcomers to find what they need to address a specific analysis question. As the scale of -omics data and their associated analytical tools has grown, the scientific community is increasingly moving toward the use of specialized workflow management systems to build analysis pipelines8. They separate the requirements of the underlying compute infrastructure from the analysis and workflow description, introducing a higher degree of portability as compared to custom in-house scripts. One such popular tool is Nextflow9. Using Nextflow, software packages can be bundled with analysis pipelines using built-in integration for package managers, such as Conda, and containerization platforms, such as Docker and Singularity. Moreover, support for most common highperformance-computing batch schedulers and cloud providers allows simple deployment of analysis pipelines on almost any infrastructure. The opportunity to run pipelines locally during initial development and then to proceed seamlessly to largescale computational resources in highperformance-computing or cloud settings provides users and developers with great flexibility. The nf-core community project collects a curated set of best-practice analysis pipelines built using Nextflow. Similar projects Participate

[1] Benedict Paten,et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows , 2017, F1000Research.

[2] Ian M. Mitchell,et al. Best Practices for Scientific Computing , 2012, PLoS biology.

[3] M. Baker. 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[4] Silvio C. E. Tosatto,et al. Tools and data services registry: a community effort to document bioinformatics resources , 2015, Nucleic Acids Res..

[5] Rolf Backofen,et al. Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[6] Renan Valieris,et al. Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[7] Paolo Di Tommaso,et al. Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[8] Anton Nekrutenko,et al. Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[9] J. Michael Cherry,et al. ENCODE data at the ENCODE portal , 2015, Nucleic Acids Res..

[10] Tim Smith,et al. Making code citable with Zenodo and GitHub , 2015 .

[11] Jeremy Leipzig,et al. A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[12] Massimiliano Izzo,et al. FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[13] Harald Barsnes,et al. BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[14] Mauricio O. Carneiro,et al. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[15] Stian Soiland-Reyes,et al. Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis , 2017, Data Science and Engineering.

[16] Jeffrey M. Perkel,et al. A toolkit for data transparency takes shape , 2018, Nature.

[17] Vincent J. Henry,et al. OMICtools: an informative directory for multi-omic data analysis , 2014, Database J. Biol. Databases Curation.