A Combinatorial Approach to Automatic Discovery of Cluster-Patterns

Functionally related genes often appear in each others neighborhood on the genome, however the order of the genes may not be the same. These groups or clusters of genes may have an ancient evolutionary origin or may signify some other critical phenomenon and may also aid in function prediction of genes. Such gene clusters also aid toward solving the problem of local alignment of genes. Similarly, clusters of protein domains, albeit appearing in different orders in the protein sequence, suggest common functionality in spite of being nonhomologous. In the paper we address the problem of automatically discovering clusters of entities be it genes or domains: we formalize the abstract problem as a discovery problem called the πpattern problem and give an algorithm that automatically discovers the clusters of patterns in multiple data sequences. We take a model-less approach and introduce a notation for maximal patterns that drastically reduces the number of valid cluster patterns, without any loss of information, We demonstrate the automatic pattern discovery tool on motifs on E Coli protein sequences.

[1]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[3]  M. Kanehisa,et al.  A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. , 2000, Nucleic acids research.

[4]  J R Roth,et al.  Selfish operons: horizontal transfer may drive the evolution of gene clusters. , 1996, Genetics.

[5]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[6]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[7]  Gad M. Landau,et al.  Parallel Suffix-Prefix-Matching Algorithm and Applications , 1996, SIAM J. Comput..

[8]  M Kanehisa,et al.  Tandem clusters of membrane proteins in complete genome sequences. , 2000, Genome research.

[9]  George E. Fox,et al.  Conserved Gene Clusters in Bacterial Genomes Provide Further Support for the Primacy of RNA , 1997, Journal of Molecular Evolution.

[10]  A. Valencia,et al.  Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes , 1997, Journal of Molecular Evolution.

[11]  Giorgio Satta,et al.  Efficient text fingerprinting via Parikh mapping , 2003, J. Discrete Algorithms.

[12]  J. Weber,et al.  Olfactory receptor-gene clusters, genomic-inversion polymorphisms, and common chromosome rearrangements. , 2001, American journal of human genetics.

[13]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[14]  Jens Stoye,et al.  Finding All Common Intervals of k Permutations , 2001, CPM.

[15]  Laxmi Parida Some Results on Flexible-Pattern Discovery , 2000, CPM.

[16]  M Kanehisa,et al.  A comparative analysis of ABC transporters in complete microbial genomes. , 1998, Genome research.