Mining Substructures in Protein Data

In this paper we consider the `Prions' database that describes protein instances stored for human Prion proteins. The Prions database can be viewed as a database of rooted ordered labeled subtrees. Mining frequent substructures from tree databases is an important task and it has gained a considerable amount of interest in areas such as XML mining, bio informatics, Web mining etc. This has given rise to the development of many tree mining algorithms which can aid in structural comparisons, association rule discovery and in general mining of tree structured knowledge representations. Previously we have developed the MB3 tree mining algorithm, which given a minimum support threshold, efficiently discovers all frequent embedded subtrees from a database of rooted ordered labeled subtrees. In this work we apply the algorithm to the Prions database in order to extract the frequently occurring patterns, which in this case are of induced subtree type. Obtaining the set of frequent induced subtrees from the Prions database can potentially reveal some useful knowledge. This aspect will be demonstrated by providing an analysis of the extracted frequent subtrees with respect to discovering interesting protein information. Furthermore, the minimum support threshold can be used as the controlling factor for answering specific queries posed on the Prions dataset. This approach is shown to be a viable technique for mining protein data

[1]  Kaizhong Zhang,et al.  Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[2]  William R. Taylor,et al.  Structure Motif Discovery and Mining the PDB , 2002, German Conference on Bioinformatics.

[3]  Thomas Lengauer,et al.  An Algorithm for Finding Maximal Common Subtopologies in a Set of Protein Structures , 1996, J. Comput. Biol..

[4]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[5]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[6]  Yun Chi,et al.  HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[7]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[8]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[9]  Hiroyuki Kawano,et al.  AMIOT: induced ordered tree mining in tree-structured databases , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[10]  Tharam S. Dillon,et al.  IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding , 2006, PAKDD.

[11]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[12]  Srinivasan Parthasarathy,et al.  MotifMiner: Efficient discovery of common substructures in biochemical molecules , 2005, Knowledge and Information Systems.

[13]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[14]  Joost N. Kok,et al.  Efficient discovery of frequent unordered trees , 2003 .

[15]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[16]  Stefan Kramer,et al.  Frequent free tree discovery in graph data , 2004, SAC '04.

[17]  Tharam S. Dillon,et al.  MB3-Miner: mining eMBedded subTREEs using Tree Model Guided candidate generation , 2005 .

[18]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[19]  D.J. Cook,et al.  Structural mining of molecular biology data , 2001, IEEE Engineering in Medicine and Biology Magazine.

[20]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[21]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[22]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Ambuj K. Singh,et al.  Deriving phylogenetic trees from the similarity analysis of metabolic pathways , 2003, ISMB.

[26]  Douglas L. Brutlag,et al.  Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations , 1997, ISMB.

[27]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[28]  Srinivasan Parthasarathy,et al.  Defect Detection in Silicon and Alloys , 2002 .

[29]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Tharam S. Dillon,et al.  Protein ontology: vocabulary for protein data , 2005, Third International Conference on Information Technology and Applications (ICITA'05).

[31]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[33]  Ting-Fung Chan,et al.  Chemical genomics: a systematic approach in biological research and drug discovery. , 2002, Current issues in molecular biology.

[34]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[35]  Dan M. Cooper Human genome program. , 1989, Science.

[36]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[37]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[38]  Srinivasan Parthasarathy,et al.  Automatically deriving multi-level protein structures through data mining , 2001 .

[39]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[40]  P Bucher,et al.  The FHA domain: a putative nuclear signalling domain found in protein kinases and transcription factors. , 1995, Trends in biochemical sciences.

[41]  Jon M. Kleinberg,et al.  Fast Detection of Common Geometric Substructure in Proteins , 1999, J. Comput. Biol..

[42]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[43]  J. Richardson,et al.  Principles and Patterns of Protein Conformation , 1989 .

[44]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[45]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[46]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[47]  George Karypis,et al.  Discovering frequent geometric subgraphs , 2007, Inf. Syst..

[48]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[49]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[50]  Andrew J. Holloway,et al.  Options available—from start to finish—for obtaining data from DNA microarrays II , 2002, Nature Genetics.

[51]  Srinivasan Parthasarathy,et al.  Mining of Complex Evolutionary Phenomena , 2002 .

[52]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[53]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[54]  Dennis Shasha,et al.  TreeRank: a similarity measure for nearest neighbor searching in phylogenetic databases , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[55]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[56]  Dennis Shasha,et al.  Introduction to Data Mining in Bioinformatics , 2005, Data Mining in Bioinformatics.

[57]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[58]  Srinivasan Parthasarathy,et al.  MotifMiner: a general toolkit for efficiently identifying common substructures in molecules , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[59]  Tharam S. Dillon,et al.  SEQUEST: Mining frequent subsequences using DMA strips , 2006 .

[60]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[61]  Srinivasan Parthasarathy,et al.  Efficient discovery of common substructures in macromolecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[62]  D. Durocher,et al.  The molecular basis of FHA domain:phosphopeptide binding specificity and implications for phospho-dependent signaling mechanisms. , 2000, Molecular cell.

[63]  K. S. Raghavan,et al.  Relationships in the Organization of Knowledge , 2001 .

[64]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[65]  Thorsten Altenkirch,et al.  for Data: Differentiating Data Structures , 2005, Fundam. Informaticae.

[66]  Hiroki Arimura,et al.  Discovering Frequent Substructures in Large Unordered Trees , 2003, Discovery Science.

[67]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[68]  Hannu Toivonen,et al.  Data Mining In Bioinformatics , 2005 .

[69]  Mohammed J. Zaki Efficiently Mining Frequent Embedded Unordered Trees , 2004, Fundam. Informaticae.

[70]  P Willett,et al.  Use of techniques derived from graph theory to compare secondary structure motifs in proteins. , 1990, Journal of molecular biology.