Efficient discovery of common substructures in macromolecules

Biological macromolecules play a fundamental role in disease; therefore, they are of great interest to fields such as pharmacology and chemical genomics. Yet due to macromolecules' complexity, development of effective techniques for elucidating structure-function macromolecular relationships has been ill explored. Previous techniques have either focused on sequence analysis, which only approximates structure-function relationships, or on small coordinate datasets, which does not scale to large datasets or handle noise. We present a novel scalable approach to efficiently discover macromolecule substructures based on three-dimensional coordinate data, without domain-specific knowledge. The approach combines structure-based frequent pattern discovery with search space reduction and coordinate noise handling. We analyze computational performance compared to traditional approaches, validate that our approach can discover meaningful substructures in noisy macromolecule data by automated discovery of primary and secondary protein structures, and show that our technique is superior to sequence-based approaches at determining structural, and thus functional, similarity between proteins.

[1]  Etsuko N. Moriyama,et al.  Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties , 2000, Bioinform..

[2]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Three Dimensional Molecules , 1997, KDD.

[3]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[4]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5]  Srinivasan Parthasarathy,et al.  Incremental and interactive sequence mining , 1999, CIKM '99.

[6]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[7]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[8]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[11]  Jiawei Han,et al.  An Efficient Two-Step Method for Classification of Spatial Data , 1998 .

[12]  Srinivasan Parthasarathy,et al.  Automatically deriving multi-level protein structures through data mining , 2001 .

[13]  Ting-Fung Chan,et al.  Chemical genomics: a systematic approach in biological research and drug discovery. , 2002, Current issues in molecular biology.

[14]  Heikki Mannila,et al.  Discovering Generalized Episodes Using Minimal Occurrences , 1996, KDD.

[15]  Lawrence B. Holder,et al.  Analyzing the Benefits of Domain Knowledge in Substructure Discovery , 1995, KDD.

[16]  Amanda Clare,et al.  Genome scale prediction of protein functional class from sequence using data mining , 2000, KDD '00.

[17]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[18]  William R. Taylor,et al.  Structure Motif Discovery and Mining the PDB , 2002, German Conference on Bioinformatics.

[19]  Edward G. Coffman,et al.  File structures using hashing functions , 1970, CACM.

[20]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.