Graph-based pattern discovery in protein structures

The rapidly growing body of 3D protein structure data provides new opportunities to study the relation between protein structure and protein function. Local structure pattern of proteins has been the focus of recent efforts to link structural features found in proteins to protein function. In addition, structure patterns have demonstrated values in applications such as predicting protein-protein interaction, engineering proteins, and designing novel medicines. My thesis introduces graph-based representations of protein structure and new subgraph mining algorithms to identify recurring structure patterns common to a set of proteins. These techniques enable families of proteins exhibiting similar function to be analyzed for structural similarity. Previous approaches to protein local structure pattern discovery operate in a pairwise fashion and have prohibitive computational cost when scaled to families of proteins. The graph mining strategy is robust in the face of errors in the structure, and errors in the set of proteins thought to share a function. Two collaborations with domain experts at the UNC School of Pharmacy and the UNC Medical School demonstrate the utility of these techniques. The first is to predict the function of several newly characterized protein structures. The second is to identify conserved structural features in evolutionarily related proteins.

[1]  Gregory A.Petsko and Dagmar Ringe Protein structure and function , 2003 .

[2]  Janet M Thornton,et al.  Protein function prediction using local 3D templates. , 2005, Journal of molecular biology.

[3]  Gail J. Bartlett,et al.  Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. , 2005, Journal of molecular biology.

[4]  Sriram Raghavan,et al.  Representing Web graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[5]  M. Horikoshi,et al.  Relationship between the subcellular localization and structures of catalytic domains of FKBP-type PPIases. , 1999, Journal of biochemistry.

[6]  C. Orengo,et al.  One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. , 2002, Journal of molecular biology.

[7]  Gregory A. Petsko,et al.  Mandelate racemase and muconate lactonizing enzyme are mechanistically distinct and structurally homologous , 1990, Nature.

[8]  Jiong Yang,et al.  Mining Sequential Patterns from Large Data Sets , 2005, Advances in Database Systems.

[9]  Janet M. Thornton,et al.  SCOPEC: a database of protein catalytic domains , 2004, ISMB/ECCB.

[10]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[12]  Eyke Hüllermeier,et al.  Efficient similarity search in protein structure databases by k-clique hashing , 2004, Bioinform..

[13]  E V Koonin,et al.  Phosphoesterase domains associated with DNA polymerases of diverse origins. , 1998, Nucleic acids research.

[14]  Xiaohong Liu,et al.  Structure and function of Nurr1 identifies a class of ligand-independent nuclear receptors , 2003, Nature.

[15]  Bonnie Berger,et al.  trilogy: Discovery of sequence–structure patterns across diverse proteins , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Mohammed J. Zaki,et al.  Mining Protein Contact Maps , 2002, BIOKDD.

[17]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Svetha Venkatesh,et al.  Video indexing and similarity retrieval by largest common subgraph detection using decision trees , 2001, Pattern Recognit..

[19]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[20]  A Wlodawer,et al.  Catalytic triads and their relatives. , 1998, Trends in biochemical sciences.

[21]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[22]  David T. Jones,et al.  Bioinformatics: Genes, Proteins and Computers , 2007 .

[23]  Eleanor J. Gardiner,et al.  Clique-detection algorithms for matching three-dimensional molecular structures. , 1997, Journal of molecular graphics & modelling.

[24]  P Willett,et al.  Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. , 1993, Journal of molecular biology.

[25]  Alexander Tropsha,et al.  Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations , 2003, Bioinform..

[26]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[27]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[28]  A Valencia,et al.  Three-dimensional view of the surface motif associated with the P-loop structure: cis and trans cases of convergent evolution. , 2000, Journal of molecular biology.

[29]  Cyrus Chothia,et al.  SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments , 2002, Nucleic Acids Res..

[30]  Ying Wei,et al.  Prediction of active sites for protein structures from computed chemical properties , 2005, ISMB.

[31]  William R. Taylor,et al.  Structure Motif Discovery and Mining the PDB , 2002, German Conference on Bioinformatics.

[32]  Christian von Mering,et al.  STRING: a database of predicted functional associations between proteins , 2003, Nucleic Acids Res..

[33]  Eugene V. Koonin,et al.  Power Laws, Scale-Free Networks and Genome Biology , 2006 .

[34]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[35]  Iosif I. Vaisman,et al.  Delaunay Tessellation of Proteins: Four Body Nearest-Neighbor Propensities of Amino Acid Residues , 1996, J. Comput. Biol..

[36]  Maryna Kapustina,et al.  Structure alignment via Delaunay tetrahedralization , 2005, Proteins.

[37]  Stephen K. Burley,et al.  An overview of structural genomics , 2000, Nature Structural Biology.

[38]  Janet M. Thornton,et al.  An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis , 2003, Bioinform..

[39]  T. Picknett,et al.  Journal of Molecular Biology: a publishers perspective. , 1999, Journal of molecular biology.

[40]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[41]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[42]  M. Michael Gromiha,et al.  Exploring the environmental preference of weak interactions in (α/β)8 barrel proteins , 2006 .

[43]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[44]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[45]  J. Kendrew,et al.  A Three-Dimensional Model of the Myoglobin Molecule Obtained by X-Ray Analysis , 1958, Nature.

[46]  Wei Wang,et al.  Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs , 2005, J. Comput. Biol..

[47]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[48]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[49]  Adnan Darwiche,et al.  Inference in belief networks: A procedural guide , 1996, Int. J. Approx. Reason..

[50]  Shimon Weiss,et al.  Measuring conformational dynamics of biomolecules by single molecule fluorescence spectroscopy , 2000, Nature Structural Biology.

[51]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[52]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[53]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[54]  Monica Riley,et al.  GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins , 2004, Nucleic Acids Res..

[55]  M Madan Babu,et al.  Exploring the environmental preference of weak interactions in (alpha/beta)8 barrel proteins. , 2006, Proteins.

[56]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[57]  J. Snoeyink,et al.  USING FAST SUBGRAPH ISOMORPHISM CHECKING FOR PROTEIN FUNCTIONAL ANNOTATION USING SCOP AND GENE ONTOLOGY , 2004 .

[58]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[59]  P. Babbitt,et al.  Superfamily active site templates , 2004, Proteins.

[60]  J. Newman,et al.  Class‐directed structure determination: Foundation for a protein structure initiative , 1998, Protein science : a publication of the Protein Society.

[61]  Nicholas Ayache,et al.  A geometric algorithm to find small but highly similar 3D substructures in proteins , 1998, Bioinform..

[62]  J. Thornton,et al.  Searching for functional sites in protein structures. , 2004, Current opinion in chemical biology.

[63]  M. Sternberg,et al.  Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. , 2001, Journal of molecular biology.

[64]  H. Wolfson,et al.  Recognition of Functional Sites in Protein Structures☆ , 2004, Journal of Molecular Biology.

[65]  Roded Sharan,et al.  Identification of Protein Complexes by Comparative Analysis of Yeast and Bacterial Protein Interaction Data , 2005, J. Comput. Biol..

[66]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[67]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[68]  R. Sauer,et al.  Tolerance of a protein to multiple polar‐to‐hydrophobic surface substitutions , 2008, Protein science : a publication of the Protein Society.

[69]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[70]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[71]  Jacquelyn S. Fetrow,et al.  Structural genomics and its importance for gene function analysis , 2000, Nature Biotechnology.

[72]  John P. Overington,et al.  Insights into protein function through large-scale computational analysis of sequence and structure. , 2001, Trends in biotechnology.

[73]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[74]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[75]  Berthold K. P. Horn,et al.  Closed-form solution of absolute orientation using unit quaternions , 1987 .

[76]  Ashish V. Tendulkar,et al.  Functional sites in protein families uncovered via an objective and automated graph theoretic approach. , 2003, Journal of molecular biology.

[77]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[78]  C. Branden,et al.  Introduction to protein structure , 1991 .

[79]  Yun Chi,et al.  Indexing and mining free trees , 2003, Third IEEE International Conference on Data Mining.

[80]  Gary L Gilliland,et al.  Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site , 2003, Proteins.

[81]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[82]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[83]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[84]  Charles DeLisi,et al.  Functional fingerprints of folds: evidence for correlated structure-function evolution. , 2003, Journal of molecular biology.

[85]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[86]  Lisa N Kinch,et al.  CASP5 target classification , 2003, Proteins.

[87]  P. Willett,et al.  A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. , 1994, Journal of molecular biology.

[88]  I. Jonassen,et al.  Discovery of local packing motifs in protein structures , 1999, Proteins.

[89]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[90]  R. Nussinov,et al.  Three‐dimensional, sequence order‐independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: Potential implications to evolution and to protein folding , 1994, Protein science : a publication of the Protein Society.

[91]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[92]  C. Ponting,et al.  On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? , 2001, Journal of structural biology.

[93]  E. Hall,et al.  The nature of biotechnology. , 1988, Journal of biomedical engineering.

[94]  Jack Snoeyink,et al.  Almost-Delaunay simplices: nearest neighbor relations for imprecise points , 2004, SODA '04.

[95]  Conrad C. Huang,et al.  Representing Structure-Function Relationships in Mechanistically Diverse Enzyme Superfamilies , 2004, Pacific Symposium on Biocomputing.

[96]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[97]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[98]  G. Schneider,et al.  Advances in the prediction of protein targeting signals , 2004, Proteomics.

[99]  G. Klebe,et al.  A new method to detect related function among proteins independent of sequence and fold homology. , 2002, Journal of molecular biology.

[100]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[101]  E I Shakhnovich,et al.  Identifying the protein folding nucleus using molecular dynamics. , 1998, Journal of molecular biology.

[102]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[103]  Veronica Rotemberg,et al.  CoC: a database of universally conserved residues in protein folds , 2005, Bioinform..

[104]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[105]  B. Matthews,et al.  Structural and genetic analysis of the folding and function of T4 lysozyme , 1996, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[106]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[107]  Alexander Tropsha,et al.  Identification of Sequence-Specific Tertiary Packing Motifs in Protein Structures using Delaunay Tessellation , 2002 .

[108]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[109]  William R. Taylor,et al.  Protein bioinformatics - an algorithmic approach to sequence and structure analysis , 2004 .

[110]  Janet M. Thornton,et al.  ProFunc: a server for predicting protein function from 3D structure , 2005, Nucleic Acids Res..

[111]  J. Clarke,et al.  The folding of an immunoglobulin-like Greek key protein is defined by a common-core nucleus and regions constrained by topology. , 2000, Journal of molecular biology.

[112]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[113]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[114]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[115]  S. Fields,et al.  Protein analysis on a proteomic scale , 2003, Nature.

[116]  J. Thornton,et al.  Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites , 1997, Protein science : a publication of the Protein Society.

[117]  Wei Wang,et al.  Mining protein family specific residue packing patterns from protein structure graphs , 2004, RECOMB.

[118]  D. Eisenberg,et al.  Inference of protein function from protein structure. , 2005, Structure.

[119]  Robert B Russell,et al.  Finding functional sites in structural genomics proteins. , 2004, Structure.

[120]  Peter Willett,et al.  Searching for Patterns of Amino Acids in 3D Protein Structures , 2003, J. Chem. Inf. Comput. Sci..

[121]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[122]  Ruth Nussinov,et al.  Recognition of Binding Patterns Common to a Set of Protein Structures , 2005, RECOMB.

[123]  David S. Cafiso,et al.  Identifying conformational changes with site-directed spin labeling , 2000, Nature Structural Biology.

[124]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[125]  R Nussinov,et al.  Automated multiple structure alignment and detection of a common substructural motif , 2001, Proteins.

[126]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[127]  Alexander Tropsha,et al.  Simplicial neighborhood analysis of protein packing (SNAPP): a computational geometry approach to studying proteins. , 2003, Methods in enzymology.

[128]  Gail J. Bartlett,et al.  Effective function annotation through catalytic residue conservation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[129]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[130]  Robert B. Russell,et al.  Annotation in three dimensions , 2003 .

[131]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[132]  Robert B. Russell,et al.  Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures , 2003, Nucleic Acids Res..

[133]  R. Russell,et al.  Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. , 1998, Journal of molecular biology.

[134]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[135]  Mark Gerstein,et al.  Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures , 2004, BMC Bioinformatics.

[136]  A. Fersht Structure and mechanism in protein science , 1998 .

[137]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[138]  Jian Pei,et al.  On computing condensed frequent pattern bases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[139]  M. Milik,et al.  Common Structural Cliques: a tool for protein structure and function analysis. , 2003, Protein engineering.