Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes

Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.html.

[1]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[2]  A V Finkelstein,et al.  The classification and origins of protein folding patterns. , 1990, Annual review of biochemistry.

[3]  Arthur M. Lesk,et al.  Introduction to Protein Science: Architecture, Function, and Genomics , 2001 .

[4]  A. Efimov Super-secondary structures and modeling of protein folds. , 2013, Methods in molecular biology.

[5]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[6]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[7]  A. Konagurthu,et al.  MUSTANG: A multiple structural alignment algorithm , 2006, Proteins.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[10]  E. Bell,et al.  Proteins and Enzymes , 1988 .

[11]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[12]  A. Kister,et al.  Protein Supersecondary Structures , 2013, Methods in Molecular Biology.

[13]  Lloyd Allison,et al.  Minimum message length inference of secondary structure from protein coordinate data , 2012, Bioinform..

[14]  M G Rossmann,et al.  Comparison of super-secondary structures in proteins. , 1973, Journal of molecular biology.

[15]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[16]  A M Lesk,et al.  Systematic representation of protein folding patterns. , 1995, Journal of molecular graphics.

[17]  D. Baker,et al.  A surprising simplicity to protein folding , 2000, Nature.

[18]  Lisa N Kinch,et al.  Compact Structure Patterns in Proteins. , 2016, Journal of molecular biology.

[19]  A M Lesk,et al.  Folding units in globular proteins. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[20]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[21]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..