Bioinformatic pipelines in Python with Leaf

BackgroundAn incremental, loosely planned development approach is often used in bioinformatic studies when dealing with custom data analysis in a rapidly changing environment. Unfortunately, the lack of a rigorous software structuring can undermine the maintainability, communicability and replicability of the process. To ameliorate this problem we propose the Leaf system, the aim of which is to seamlessly introduce the pipeline formality on top of a dynamical development process with minimum overhead for the programmer, thus providing a simple layer of software structuring.ResultsLeaf includes a formal language for the definition of pipelines with code that can be transparently inserted into the user’s Python code. Its syntax is designed to visually highlight dependencies in the pipeline structure it defines. While encouraging the developer to think in terms of bioinformatic pipelines, Leaf supports a number of automated features including data and session persistence, consistency checks between steps of the analysis, processing optimization and publication of the analytic protocol in the form of a hypertext.ConclusionsLeaf offers a powerful balance between plan-driven and change-driven development environments in the design, management and communication of bioinformatic pipelines. Its unique features make it a valuable alternative to other related tools.

[1]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[2]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[3]  M F Sanner,et al.  Python: a programming language for software integration and development. , 1999, Journal of molecular graphics & modelling.

[4]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[5]  Bernd Bruegge,et al.  Object-Oriented Software Engineering: Using UML, Patterns and Java, Second Edition , 2003 .

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[7]  Mathieu Fourment,et al.  A comparison of common programming languages used in bioinformatics , 2008, BMC Bioinformatics.

[8]  J. Lupski,et al.  Mechanisms of change in gene copy number , 2009, Nature Reviews Genetics.

[9]  James Cheney,et al.  Provenance as Dependency Analysis , 2007, DBPL.

[10]  Burkhard Linke,et al.  Conveyor: a workflow engine for bioinformatics analyses , 2011, Bioinform..

[11]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[12]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[13]  A. Cockburn,et al.  Agile Software Development: The People Factor , 2001, Computer.

[14]  Tao Xu,et al.  Pegasys: software for executing and integrating analyses of biological sequences , 2004, BMC Bioinformatics.

[15]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[16]  K. Ovaska,et al.  Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme , 2010, Genome Medicine.

[17]  Paolo Romano,et al.  Automation of in-silico data analysis processes through workflow management systems , 2007, Briefings Bioinform..

[18]  Bernd Bruegge,et al.  Object-Oriented Software Engineering Using UML, Patterns, and Java , 2009 .

[19]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[20]  Laurence Tratt,et al.  Dynamically Typed Languages , 2009, Adv. Comput..

[21]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[22]  Shawn Hoon,et al.  Biopipe: a flexible framework for protocol-based bioinformatics analysis. , 2003, Genome research.

[23]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[24]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[25]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..