FRAGSION: ultra-fast protein fragment library generation by IOHMM sampling

MOTIVATION Speed, accuracy and robustness of building protein fragment library have important implications in de novo protein structure prediction since fragment-based methods are one of the most successful approaches in template-free modeling (FM). Majority of the existing fragment detection methods rely on database-driven search strategies to identify candidate fragments, which are inherently time-consuming and often hinder the possibility to locate longer fragments due to the limited sizes of databases. Also, it is difficult to alleviate the effect of noisy sequence-based predicted features such as secondary structures on the quality of fragment. RESULTS Here, we present FRAGSION, a database-free method to efficiently generate protein fragment library by sampling from an Input-Output Hidden Markov Model. FRAGSION offers some unique features compared to existing approaches in that it (i) is lightning-fast, consuming only few seconds of CPU time to generate fragment library for a protein of typical length (300 residues); (ii) can generate dynamic-size fragments of any length (even for the whole protein sequence) and (iii) offers ways to handle noise in predicted secondary structure during fragment sampling. On a FM dataset from the most recent Critical Assessment of Structure Prediction, we demonstrate that FGRAGSION provides advantages over the state-of-the-art fragment picking protocol of ROSETTA suite by speeding up computation by several orders of magnitude while achieving comparable performance in fragment quality. AVAILABILITY AND IMPLEMENTATION Source code and executable versions of FRAGSION for Linux and MacOS is freely available to non-commercial users at http://sysbio.rnet.missouri.edu/FRAGSION/ It is bundled with a manual and example data. CONTACT chengji@missouri.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Sandra Macedo-Ribeiro,et al.  Structure of mycobacterial maltokinase, the missing link in the essential GlgE-pathway , 2015, Scientific Reports.

[2]  Daniel W. Kulp,et al.  Generalized Fragment Picking in Rosetta: Design, Protocols and Applications , 2011, PloS one.

[3]  M. Levitt,et al.  Protein decoy assembly using short fragments under geometric constraints , 2003, Biopolymers.

[4]  Jesper Ferkinghoff-Borg,et al.  A generative, probabilistic model of local protein structure , 2008, Proceedings of the National Academy of Sciences.

[5]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[6]  Anders Krogh,et al.  Sampling Realistic Protein Conformations Using Local Structural Bias , 2006, PLoS Comput. Biol..

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[9]  Debswapna Bhattacharya,et al.  De novo protein conformational sampling using a probabilistic graphical model , 2015, Scientific Reports.

[10]  K. Mardia,et al.  Protein Bioinformatics and Mixtures of Bivariate von Mises Distributions for Angular Data , 2007, Biometrics.

[11]  Michael Habeck,et al.  HHfrag: HMM-based fragment detection using HHpred , 2011, Bioinform..

[12]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[13]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .