rstoolbox: management and analysis of computationally designed structural ensembles

Motivation Computational protein design (CPD) calculations rely on the generation of large amounts of data on the search for the best sequences. As such, CPD workflows generally include the batch generation of designed decoys (sampling) followed by ranking and filtering stages to select those with optimal metrics (scoring). Due to these factors, the proper analysis of the decoy population is a key element for the effective selection of designs for experimental validation. Results Here, we present a set of tools for the analysis of protein design ensembles. The tool is oriented towards protein designers with basic coding training aiming to process efficiently their decoy sets as well as for protocol developers interested in benchmarking their new approaches. Although initially devised to process Rosetta design outputs, the library is extendable to other design tools. Availability and Implementation rstoolbox is implemented for python2.7 and 3.5+. Code is freely available at https://github.com/lpdi-epfl/rstoolbox under the MIT license. Full documentation and examples can be found at https://lpdi-epfl.github.io/rstoolbox.

[1]  Amelie Stein,et al.  Improvements to Robotics-Inspired Conformational Sampling in Rosetta , 2013, PloS one.

[2]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  William R Taylor,et al.  Probing the "dark matter" of protein fold space. , 2009, Structure.

[4]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[5]  Roland L. Dunbrack,et al.  The Rosetta all-atom energy function for macromolecular modeling and design , 2017, bioRxiv.

[6]  Pablo Gainza-Cirauqui,et al.  Computational protein design-the next generation tool to expand synthetic biology applications. , 2018, Current opinion in biotechnology.

[7]  H. Scheraga,et al.  Monte Carlo-minimization approach to the multiple-minima problem in protein folding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[9]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[10]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[11]  David E. Kim,et al.  Sampling bottlenecks in de novo protein structure prediction. , 2009, Journal of molecular biology.

[12]  Pablo Gainza,et al.  Algorithms for protein design. , 2016, Current opinion in structural biology.