Statistical Inference of a canonical dictionary of protein substructural fragments

Proteins are biomolecules of life. They fold into a great variety of three-dimensional (3D) shapes. Underlying these folding patterns are many recurrent structural fragments or building blocks (analogous to `LEGO bricks'). This paper reports an innovative statistical inference approach to discover a comprehensive dictionary of protein structural building blocks from a large corpus of experimentally determined protein structures. Our approach is built on the Bayesian and information-theoretic criterion of minimum message length. To the best of our knowledge, this work is the first systematic and rigorous treatment of a very important data mining problem that arises in the cross-disciplinary area of structural bioinformatics. The quality of the dictionary we find is demonstrated by its explanatory power -- any protein within the corpus of known 3D structures can be dissected into successive regions assigned to fragments from this dictionary. This induces a novel one-dimensional representation of three-dimensional protein folding patterns, suitable for application of the rich repertoire of character-string processing algorithms, for rapid identification of folding patterns of newly-determined structures. This paper presents the details of the methodology used to infer the dictionary of building blocks, and is supported by illustrative examples to demonstrate its effectiveness and utility.

[1]  M J Rooman,et al.  Automatic definition of recurrent local structure motifs in proteins. , 1990, Journal of molecular biology.

[2]  G. Gilardi,et al.  Comparison of the refined crystal structures of wild-type (1.34 A) flavodoxin from Desulfovibrio vulgaris and the S35C mutant (1.44 A) at 100 K. , 2002, Acta crystallographica. Section D, Biological crystallography.

[3]  George D. Rose,et al.  A protein taxonomy based on secondary structure , 1999, Nature Structural Biology.

[4]  A M Lesk,et al.  NAD-binding domains of dehydrogenases. , 1995, Current opinion in structural biology.

[5]  J L Sussman,et al.  A 3D building blocks approach to analyzing and predicting structure of proteins , 1989, Proteins.

[6]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[7]  M. Levitt,et al.  Small libraries of protein fragments model native protein structures accurately. , 2002, Journal of molecular biology.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  A Maritan,et al.  Recurrent oligomers in proteins: An optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies , 2000, Proteins.

[10]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[11]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[12]  S. Kearsley On the orthogonal transformation used for structural comparisons , 1989 .

[13]  Adam Godzik,et al.  Connecting the protein structure universe by using sparse recurring fragments. , 2005, Structure.

[14]  J F Boisvieux,et al.  Hidden Markov model approach for identifying the modular framework of the protein backbone. , 1999, Protein engineering.

[15]  N. Sloane,et al.  On the Voronoi Regions of Certain Lattices , 1984 .

[16]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[17]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[18]  W. Li,et al.  Simple method for constructing phylogenetic trees from distance matrices. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[19]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[20]  Lloyd Allison,et al.  Minimum message length inference of secondary structure from protein coordinate data , 2012, Bioinform..

[21]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[22]  Bohdan Schneider,et al.  A short survey on protein blocks , 2010, Biophysical Reviews.

[23]  L. Pauling,et al.  The pleated sheet, a new layer configuration of polypeptide chains. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[24]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..