GenPipes: an open-source framework for distributed and scalable genomic analyses

With the decreasing cost of sequencing and the rapid developments in genomics technologies and protocols, the need for validated bioinformatics software that enables efficient large-scale data processing is growing. Here we present GenPipes, a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for High Performance Computing clusters and the cloud. GenPipes already implements 12 validated and scalable pipelines for various genomics applications, including RNA-Seq, ChIP-Seq, DNA-Seq, Methyl-Seq, Hi-C, capture Hi-C, metagenomics and PacBio long read assembly. The software is available under a GPLv3 open source license and is continuously updated to follow recent advances in genomics and bioinformatics. The framework has been already configured on several servers and a docker image is also available to facilitate additional installations. In summary, GenPipes offers genomic researchers a simple method to analyze different types of data, customizable to their needs and resources, as well as the flexibility to create their own workflows.

[1]  Sohrab P. Shah,et al.  Kronos: a workflow assembler for genome analytics and informatics , 2016, bioRxiv.

[2]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[3]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[4]  Mark Yandell,et al.  Wham: Identifying Structural Variants of Biological Consequence , 2015, PLoS Comput. Biol..

[5]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[6]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[7]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[8]  Edgars Celms,et al.  Variation in genomic landscape of clear cell renal cell carcinoma across Europe , 2014, Nature Communications.

[9]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[10]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[11]  E. Mardis The $1,000 genome, the $100,000 analysis? , 2010, Genome Medicine.

[12]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[13]  F. Brimo,et al.  Changes in the expression profiles of claudins during gonocyte differentiation and in seminomas , 2016, Andrology.

[14]  Mathieu Blanchette,et al.  BigDataScript: a scripting language for data pipelines , 2014, Bioinform..

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  P. Laird,et al.  Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data , 2012, Genome Biology.

[17]  Fabian A. Buske,et al.  NGSANE: a lightweight production informatics framework for high-throughput data analysis , 2014, Bioinform..

[18]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[19]  Michael Brudno,et al.  Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecular subgroups and recurrent activating ACVR1 mutations , 2014, Nature Genetics.

[20]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[21]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[22]  R Redon,et al.  17q21.31 duplication causes prominent tau-related dementia with increased MAPT expression , 2017, Molecular Psychiatry.

[23]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[24]  Chris Williams,et al.  RNA-SeQC: RNA-seq metrics for quality control and process optimization , 2012, Bioinform..

[25]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[26]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[27]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[28]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[29]  Obi L. Griffith,et al.  Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud , 2015, PLoS Comput. Biol..

[30]  François Dubeau,et al.  High Rate of Recurrent De Novo Mutations in Developmental and Epileptic Encephalopathies. , 2017, American journal of human genetics.

[31]  Philip A. Ewels,et al.  HiCUP: pipeline for mapping and processing Hi-C data , 2015, F1000Research.

[32]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[33]  P. Buncic,et al.  CernVM – a virtual software appliance for LHC applications , 2010 .

[34]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[35]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[36]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[37]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[38]  Guillaume Bourque,et al.  Identification of Elongated Primary Cilia with Impaired Mechanotransduction in Idiopathic Scoliosis Patients , 2017, Scientific Reports.

[39]  Eric Talevich,et al.  CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing , 2016, PLoS Comput. Biol..

[40]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[41]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[42]  Jonathan M. Cairns,et al.  CHiCAGO: robust detection of DNA looping interactions in Capture Hi-C data , 2015, Genome Biology.

[43]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[44]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[45]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[46]  Steven J. M. Jones,et al.  The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery , 2016, Cell.

[47]  Gary D Stormo,et al.  An Overview of RNA Sequence Analyses: Structure Prediction, ncRNA Gene Identification, and RNAi Design , 2013, Current protocols in bioinformatics.

[48]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[49]  Charles Shubert,et al.  StarHPC — Teaching parallel programming within elastic compute cloud , 2009, Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces.

[50]  Sven Rahmann,et al.  Genome analysis , 2022 .

[51]  Aaron R. Quinlan,et al.  GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations , 2013, PLoS Comput. Biol..

[52]  E. Birney,et al.  eFORGE: A Tool for Identifying Cell Type-Specific Signal in Epigenomic Data , 2016, Cell reports.

[53]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[54]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[55]  Jacqueline A. Keane,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, Genome Biology.

[56]  Obi L. Griffith,et al.  Genome Modeling System: A Knowledge Management Platform for Genomics , 2015, PLoS Comput. Biol..

[57]  O. Hofmann,et al.  VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research , 2016, Nucleic acids research.

[58]  Adam P. Arkin,et al.  FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix , 2009, Molecular biology and evolution.

[59]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[60]  Maxime Caron,et al.  ERRα mediates metabolic adaptations driving lapatinib resistance in breast cancer , 2016, Nature Communications.

[61]  Andrew I. Su,et al.  Omics Pipe: a community-based framework for reproducible multi-omics data analysis , 2015, Bioinform..

[62]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[63]  X. Zhou,et al.  TopDom: an efficient and deterministic method for identifying topological domains in genomes , 2015, Nucleic acids research.

[64]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[65]  Joachim Weischenfeldt,et al.  SvABA: genome-wide detection of structural variants and indels by local assembly , 2018, Genome research.

[66]  R. Redon,et al.  Contribution to Alzheimer's disease risk of rare variants in TREM2, SORL1, and ABCA7 in 1779 cases and 1273 controls , 2017, Neurobiology of Aging.

[67]  Neva C. Durand,et al.  Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. , 2016, Cell systems.

[68]  Guillaume Bourque,et al.  Global characterization of copy number variants in epilepsy patients from whole genome sequencing , 2018, PLoS genetics.

[69]  Mei Lu,et al.  Integrated (epi)-Genomic Analyses Identify Subgroup-Specific Therapeutic Targets in CNS Rhabdoid Tumors. , 2016, Cancer cell.

[70]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[71]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[72]  Rob Knight,et al.  Using QIIME to Analyze 16S rRNA Gene Sequences from Microbial Communities , 2011, Current protocols in bioinformatics.