Discovery of Functional Motifs from the Interface Region of Oligomeric Proteins Using Frequent Subgraph Mining

Modeling the interface region of a protein complex paves the way for understanding its dynamics and functionalities. Existing works model the interface region of a complex by using different approaches, such as, the residue composition at the interface region, the geometry of the interface residues, or the structural alignment of interface regions. These approaches are useful for ranking a set of docked conformation or for building scoring function for protein-protein docking, but they do not provide a generic and scalable technique for the extraction of interface patterns leading to functional motif discovery. In this work, we model the interface region of a protein complex by graphs and extract interface patterns of the given complex in the form of frequent subgraphs. To achieve this, we develop a scalable algorithm for frequent subgraph mining. We show that a systematic review of the mined subgraphs provides an effective method for the discovery of functional motifs that exist along the interface region of a given protein complex. In our experiments, we use three PDB protein structure datasets. The first two datasets are composed of PDB structures from different conformations of two dimeric protein complexes: HIV-1 protease (329 structures), and triosephosphate isomerase (TIM) (86 structures). The third dataset is a collection of different enzyme structures protein structures from the six top-level enzyme classes, namely: Oxydoreductase, Transferase, Hydrolase, Lyase, Isomerase, and Ligase. We show that for the first two datasets, our method captures the locking mechanism at the dimeric interface by taking into account the spatial positioning of the interfacial residues through graphs. Indeed, our frequent subgraph mining based approach discovers the patterns representing the dimerization lock which is formed at the base of the structure in 323 of the 329 HIV-1 protease structures. Similarly, for 86 TIM structures, our approach discovers the dimerization lock formation in 50 structures. For the enzyme structures, we show that we are able to capture the functional motifs (active sites) that are specific to each of the six top-level classes of enzymes through frequent subgraphs.

[1]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[2]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[3]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[4]  F. C. Hartman,et al.  Structure of yeast triosephosphate isomerase at 1.9-A resolution. , 1990, Biochemistry.

[5]  A. Rao,et al.  A Markov chain Monte carol method for generating random (0, 1)-matrices with given marginals , 1996 .

[6]  Luc De Raedt,et al.  Frequent Hypergraph Mining , 2006, ILP.

[7]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[8]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[9]  Mohammad Al Hasan,et al.  FS3: A sampling based method for top-k frequent subgraph mining , 2014, BigData.

[10]  Chittibabu Guda,et al.  Discovering Distinct Functional Modules of Specific Cancer Types Using Protein-Protein Interaction Networks , 2015, BioMed research international.

[11]  Felice C. Lightstone,et al.  Rapid Catalytic Template Searching as an Enzyme Function Prediction Procedure , 2013, PloS one.

[12]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Sarah A Teichmann,et al.  Evolution of protein structures and interactions from the perspective of residue contact networks. , 2013, Current opinion in structural biology.

[14]  Bin Hu,et al.  Hierarchical graphs for rule-based modeling of biochemical systems , 2011, BMC Bioinformatics.

[15]  Janet M. Thornton,et al.  The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes , 2013, Nucleic Acids Res..

[16]  Preetam Ghosh,et al.  The Structural Role of Feed-Forward Loop Motif in Transcriptional Regulatory Networks , 2016, Mob. Networks Appl..

[17]  Ru Shen,et al.  Mining functional subgraphs from cancer protein-protein interaction networks , 2012, BMC Systems Biology.

[18]  Ashish V. Tendulkar,et al.  Functional sites in protein families uncovered via an objective and automated graph theoretic approach. , 2003, Journal of molecular biology.

[19]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[20]  Wajdi Dhifli,et al.  MR-SimLab: Scalable subgraph selection with label similarity for big data , 2013, Inf. Syst..

[21]  Douglas L. Brutlag,et al.  The EMOTIF database , 2001, Nucleic Acids Res..

[22]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[23]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[24]  Edward J. Oakeley,et al.  Computational Structural Analysis: Multiple Proteins Bound to DNA , 2008, PloS one.

[25]  Pedro Manuel Pinto Ribeiro,et al.  A Scalable Parallel Approach for Subgraph Census Computation , 2014, Euro-Par Workshops.

[26]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[27]  Ozlem Keskin,et al.  Analysis and network representation of hotspots in protein interfaces using minimum cut trees , 2010, Proteins.

[28]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[29]  Xiaofeng He,et al.  A unified representation of multiprotein complex data for modeling interaction networks , 2004, Proteins.

[30]  Csaba Böde,et al.  Network analysis of protein dynamics , 2007, FEBS letters.

[31]  Engelbert Mephu Nguifo,et al.  Protein sequences classification by means of feature extraction with substitution matrices , 2010, BMC Bioinformatics.

[32]  L. Greene Protein structure networks. , 2012, Briefings in functional genomics.

[33]  Z. Weng,et al.  A novel shape complementarity scoring function for protein‐protein docking , 2003, Proteins.

[34]  G. Cooper The Cell: A Molecular Approach , 1996 .

[35]  Mohammad Al Hasan,et al.  An Iterative MapReduce Based Frequent Subgraph Mining Algorithm , 2013, IEEE Transactions on Knowledge and Data Engineering.

[36]  WAJDI DHIFLI,et al.  Smoothing 3D Protein Structure Motifs Through Graph Mining and Amino Acid Similarities , 2014, J. Comput. Biol..

[37]  Pedro A Fernandes,et al.  A new scoring function for protein-protein docking that identifies native structures with unprecedented accuracy. , 2015, Physical chemistry chemical physics : PCCP.

[38]  Z. Weng,et al.  ZDOCK: An initial‐stage protein‐docking algorithm , 2003, Proteins.

[39]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[40]  M Karplus,et al.  Small-world view of the amino acids that play a key role in protein folding. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[42]  Kamalakar Karlapalem,et al.  MARGIN: Maximal Frequent Subgraph Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[43]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[44]  M. Newman,et al.  On the uniform generation of random graphs with prescribed degree sequences , 2003, cond-mat/0312028.

[45]  Masaru Tomita,et al.  Proteins as networks: usefulness of graph theory in protein science. , 2008, Current protein & peptide science.

[46]  Mohammad Al Hasan,et al.  ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns , 2008 .

[47]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[48]  J. Rokne,et al.  Multi-scale modularity and motif distributional effect in metabolic networks. , 2015, Current protein & peptide science.

[49]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[50]  Saraswathi Vishveshwara,et al.  Oligomeric protein structure networks: insights into protein-protein interactions , 2005, BMC Bioinformatics.

[51]  Mohammad Al Hasan,et al.  An integrated, generic approach to pattern mining: data mining template library , 2008, Data Mining and Knowledge Discovery.

[52]  Ataur R. Katebi,et al.  The critical role of the loops of triosephosphate isomerase for its oligomerization, dynamics, and functionality , 2014, Protein science : a publication of the Protein Society.

[53]  Saraswathi Vishveshwara,et al.  Protein Structure and Function: Looking through the Network of Side-Chain Interactions. , 2015, Current protein & peptide science.

[54]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[55]  Robert L Jernigan,et al.  The use of experimental structures to model protein dynamics. , 2015, Methods in molecular biology.

[56]  Didier Rognan,et al.  Encoding Protein-Ligand Interaction Patterns in Fingerprints and Graphs , 2013, J. Chem. Inf. Model..

[57]  Chun-Hsi Huang,et al.  Biological network motif detection: principles and practice , 2012, Briefings Bioinform..

[58]  Mohammad Al Hasan,et al.  Finding Network Motifs Using MCMC Sampling , 2015, CompleNet.

[59]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[60]  Wajdi Dhifli,et al.  ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space , 2015, ArXiv.

[61]  Richard A. Volz,et al.  Estimating 3-D location parameters using dual number quaternions , 1991, CVGIP Image Underst..

[62]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..