rstoolbox - a Python library for large-scale analysis of computational protein design data and structural bioinformatics

BackgroundLarge-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research. Experimental approaches and computational modelling methods are generating biological data at an unprecedented rate. The detailed analysis of structure-sequence relationships is critical to unveil governing principles of protein folding, stability and function. Computational protein design (CPD) has emerged as an important structure-based approach to engineer proteins for novel functions. Generally, CPD workflows rely on the generation of large numbers of structural models to search for the optimal structure-sequence configurations. As such, an important step of the CPD process is the selection of a small subset of sequences to be experimentally characterized. Given the limitations of current CPD scoring functions, multi-step design protocols and elaborated analysis of the decoy populations have become essential for the selection of sequences for experimental characterization and the success of CPD strategies.ResultsHere, we present the rstoolbox, a Python library for the analysis of large-scale structural data tailored for CPD applications. rstoolbox is oriented towards both CPD software users and developers, being easily integrated in analysis workflows. For users, it offers the ability to profile and select decoy sets, which may guide multi-step design protocols or for follow-up experimental characterization. rstoolbox provides intuitive solutions for the visualization of large sequence/structure datasets (e.g. logo plots and heatmaps) and facilitates the analysis of experimental data obtained through traditional biochemical techniques (e.g. circular dichroism and surface plasmon resonance) and high-throughput sequencing. For CPD software developers, it provides a framework to easily benchmark and compare different CPD approaches. Here, we showcase the rstoolbox in both types of applications.Conclusionsrstoolbox is a library for the evaluation of protein structures datasets tailored for CPD data. It provides interactive access through seamless integration with IPython, while still being suitable for high-performance computing. In addition to its functionalities for data analysis and graphical representation, the inclusion of rstoolbox in protein design pipelines will allow to easily standardize the selection of design candidates, as well as, to improve the overall reproducibility and robustness of CPD selection processes.

[1]  H. Scheraga,et al.  Monte Carlo-minimization approach to the multiple-minima problem in protein folding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[3]  Gaohua Liu,et al.  Principles for designing proteins with cavities formed by curved β sheets , 2017, Science.

[4]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[5]  S. L. Mayo,et al.  Enzyme-like proteins by computational design , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  David Baker,et al.  A Computationally Designed Inhibitor of an Epstein-Barr Viral Bcl-2 Protein Induces Apoptosis in Infected Cells , 2014, Cell.

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  Niles A Pierce,et al.  Protein design is NP-hard. , 2002, Protein engineering.

[9]  Amelie Stein,et al.  Improvements to Robotics-Inspired Conformational Sampling in Rosetta , 2013, PloS one.

[10]  Michael L. Waskom,et al.  mwaskom/seaborn: v0.9.0 (July 2018) , 2018 .

[11]  Amy C. Anderson,et al.  Computational structure-based redesign of enzyme activity , 2009, Proceedings of the National Academy of Sciences.

[12]  Roland L. Dunbrack,et al.  The Rosetta all-atom energy function for macromolecular modeling and design , 2017, bioRxiv.

[13]  Pablo Gainza-Cirauqui,et al.  Computational protein design-the next generation tool to expand synthetic biology applications. , 2018, Current opinion in biotechnology.

[14]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[15]  Bruce R Donald,et al.  Predicting resistance mutations using protein design algorithms , 2010, Proceedings of the National Academy of Sciences.

[16]  Alain Roussel,et al.  X-ray Structure and Ligand Binding Study of a Moth Chemosensory Protein* , 2002, The Journal of Biological Chemistry.

[17]  William R Taylor,et al.  Probing the "dark matter" of protein fold space. , 2009, Structure.

[18]  S. L. Mayo,et al.  De novo protein design: fully automated sequence selection. , 1997, Science.

[19]  Robert D. Finn,et al.  HMMER web server: 2018 update , 2018, Nucleic Acids Res..

[20]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[21]  B. Brackett New developments , 1987, Nature.

[22]  Brian K. Shoichet,et al.  Ligand Pose and Orientational Sampling in Molecular Docking , 2013, PloS one.

[23]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[24]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[25]  G. N. Ramachandran,et al.  Stereochemistry of polypeptide chain configurations. , 1963, Journal of molecular biology.

[26]  Junichi Takagi,et al.  Computational design of an integrin I domain stabilized in the open high affinity conformation , 2000, Nature Structural Biology.

[27]  Sarah Wehrle,et al.  Rosetta FunFolDes – A general framework for the computational design of functional proteins , 2018, bioRxiv.

[28]  Pablo Gainza,et al.  Osprey: Protein Design with Ensembles, Flexibility, and Provable Algorithms , 2022 .

[29]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[30]  P. Kwong,et al.  Structural basis of respiratory syncytial virus neutralization by motavizumab , 2010, Nature Structural &Molecular Biology.

[31]  David E. Kim,et al.  Sampling bottlenecks in de novo protein structure prediction. , 2009, Journal of molecular biology.

[32]  Pablo Gainza,et al.  Algorithms for protein design. , 2016, Current opinion in structural biology.

[33]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[34]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[35]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..