An Algebro-Topological Description of Protein Domain Structure

The space of possible protein structures appears vast and continuous, and the relationship between primary, secondary and tertiary structure levels is complex. Protein structure comparison and classification is therefore a difficult but important task since structure is a determinant for molecular interaction and function. We introduce a novel mathematical abstraction based on geometric topology to describe protein domain structure. Using the locations of the backbone atoms and the hydrogen bonds, we build a combinatorial object – a so-called fatgraph. The description is discrete yet gives rise to a 2-dimensional mathematical surface. Thus, each protein domain corresponds to a particular mathematical surface with characteristic topological invariants, such as the genus (number of holes) and the number of boundary components. Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds. We introduce the notion of robust variables, that is variables that are robust towards minor changes in the structure/fatgraph, and show that the genus and the number of boundary components are robust. Further, we invesigate the distribution of different fatgraph variables and show how only four variables are capable of distinguishing different folds. We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH. In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.

[1]  Liisa Holm,et al.  Identification of homology in protein structure classification , 2001, Nature Structural Biology.

[2]  Philip E. Bourne,et al.  Multipolar representation of protein structure , 2006, BMC Bioinformatics.

[3]  Jignesh M. Patel,et al.  A framework for protein structure classification and identification of novel protein structures , 2006, BMC Bioinformatics.

[4]  Sung-Hou Kim,et al.  Local feature frequency profile: a method to measure structural similarity in proteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Levitt,et al.  Structure-based conformational preferences of amino acids. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Alexei V. Finkelstein,et al.  Protein Physics: A Course of Lectures , 2002 .

[7]  P. W. Karlsson,et al.  Parabolic section and distance excess of space curves applied to protein structure classification , 2008 .

[8]  N. P. Brown,et al.  Protein structure: geometry, topology and classification , 2001 .

[9]  Kresten Lindorff-Larsen,et al.  Protein folding and the organization of the protein topology universe. , 2005, Trends in biochemical sciences.

[10]  Anders Krogh,et al.  Sampling Realistic Protein Conformations Using Local Structural Bias , 2006, PLoS Comput. Biol..

[11]  Kian-Lee Tan,et al.  Automatic 3D Protein Structure Classification without Structural Alignment , 2005, J. Comput. Biol..

[12]  J. Skolnick,et al.  On the origin and highly likely completeness of single-domain protein structures. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Robert C. Penner,et al.  Perturbative series and the moduli space of Riemann surfaces , 1988 .

[14]  D. O’Leary,et al.  Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification , 2006, BMC Structural Biology.

[15]  Ralf Zimmer,et al.  Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis , 2009, BMC Structural Biology.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Jinn-Moon Yang,et al.  Protein structure database search and evolutionary classification , 2006, Nucleic acids research.

[19]  Carsten Wiuf,et al.  The CATH database , 2010, Human Genomics.

[20]  S. Rackovsky,et al.  Classification of Protein Sequences and Structures , 2005 .

[21]  Ian Sillitoe,et al.  The CATH Hierarchy Revisited—Structural Divergence in Domain Superfamilies and the Continuity of Fold Space , 2009, Structure.

[22]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[24]  Jesper Ferkinghoff-Borg,et al.  A generative, probabilistic model of local protein structure , 2008, Proceedings of the National Academy of Sciences.

[25]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[26]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[27]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[28]  Michael Lappe,et al.  A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 , 2001, Nucleic Acids Res..

[29]  G. Vriend,et al.  Prediction of protein conformational freedom from distance constraints , 1997, Proteins.

[30]  S. Rackovsky Sequence physical properties encode the global organization of protein structure space , 2009, Proceedings of the National Academy of Sciences.

[31]  Carsten Wiuf,et al.  Fatgraph models of proteins , 2009, 0902.1025.

[32]  Eytan Domany,et al.  Automated assignment of SCOP and CATH protein structure classifications from FSSP scores , 2002, Proteins.

[33]  William S. Massey,et al.  Algebraic Topology: An Introduction , 1977 .