Discovering knowledge in DNA and protein data

This research investigates a method for discovering knowledge in structural data. We have implemented the SUBDUE substructure discovery system which discovers interesting and repetitive subgraphs in a labeled graph representation using the minimum description length principle. Experiments have shown SUBDUE's applicability in a variety of domains. We are currently applying SUBDUE to both DNA and protein data from the Brookhaven PDB, where SUBDUE was able to find patterns in secondary structure that are both characteristic and unique to categories of proteins, such as hemoglobin and myoglobin. Ultimately, we plan to use SUBDUE to find structural patterns in functional groups of proteins and the boundaries of genes in DNA.