Structural data mining on secondary genomic structure

There exist many methods for classifying genomic data by aligning, comparing, and analyzing primary nucleotide sequences using such algorithms as decision tree and HMM. These methods are, however, not always effective as motifs are more conserved in structures than in sequences. Instead of performing classification based on primary sequences, we therefore propose to perform the task from structure, exploiting the phenomenon in which molecules form from a sequence of nucleotides, beginning with a primary sequence that can fold back onto itself to form a secondary structure. The algorithm we propose is able to perform data mining in structural data and is called random multi-level attributed (RMLA) graph algorithm for mining and representing secondary genomic structure from such biomolecule as tRNA. The identification of structural similarity is implemented with information measure concept to characterize the resultant class. Experiments are based on known tRNA structural data. The results show that our approach is able to effectively classify different class of tRNA secondary structure. We also compare our result with other classification algorithms to prove the effectiveness. The result shows our approach can classify structural data in a better way. In fact, RMLA graph is not suitable only for the classification of genomic data, wherever graphs are used to model data, it is useful for discovering patterns in the databases.

[1]  Mathias Sprinzl,et al.  Compilation of tRNA sequences and sequences of tRNA genes , 1993, Nucleic Acids Res..

[2]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Andrew Hendriks,et al.  A distributed genetic algorithm for RNA secondary structure prediction , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[4]  S. Brenner A tour of structural genomics , 2001, Nature Reviews Genetics.

[5]  S. Nock,et al.  Compilation of tRNA sequences and sequences of tRNA genes. , 1991, Nucleic acids research.

[6]  D. Corneil,et al.  An Efficient Algorithm for Graph Isomorphism , 1970, JACM.

[7]  Lawrence B. Holder,et al.  Knowledge discovery from structural data , 1995, Journal of Intelligent Information Systems.

[8]  Thomas Lengauer,et al.  An Algorithm for Finding Maximal Common Subtopologies in a Set of Protein Structures , 1996, J. Comput. Biol..

[9]  Tong Liu,et al.  Parallel RNA sequence-structure alignment , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  P Willett,et al.  Use of techniques derived from graph theory to compare secondary structure motifs in proteins. , 1990, Journal of molecular biology.

[11]  Mathias Sprinzl,et al.  Compilation of tRNA sequences , 1979, Nucleic Acids Res..