Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity

MOTIVATION One of the major features of genomic DNA sequences, distinguishing them from texts in most spoken or artificial languages, is their high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence and density of different biologically important messages. Thus, deviation from an expected number of repeats in both directions indicates a possible presence of a biological signal. Linguistic complexity corresponds to repetitiveness of a genomic text, and potential regulatory sites may be discovered through construction of typical patterns of complexity distribution. RESULTS We developed software for fast calculation of linguistic sequence complexity of DNA sequences. Our program utilizes suffix trees to compute the number of subwords present in genomic sequences, thereby allowing calculation of linguistic complexity in time linear in genome size. The measure of linguistic complexity was applied to the complete genome of Haemophilus influenzae. Maps of complexity along the entire genome were obtained using sliding windows of 40, 100, and 2000 nucleotides. This approach provided an efficient way to detect simple sequence repeats in this genome. In addition, local profiles of complexity distribution around the starts of translation were constructed for 21 complete prokaryotic genomes. We hypothesize that complexity profiles correspond to evolutionary relationships between organisms. We found principal differences in profiles of the GC-rich and other (non-GC-rich) genomes. We also found characteristic differences in profiles of AT genomes, which probably reflect individual species variations in translational regulation. AVAILABILITY The program is available upon request from Alexander Bolshoy or at http://csweb.haifa.ac.il/library/#complex.

[1]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.

[2]  David B. Searls,et al.  Linguistic approaches to biological sequences , 1997, Comput. Appl. Biosci..

[3]  David Landsman,et al.  Curved DNA in promoter sequences , 1999, Silico Biol..

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Andrzej K. Konopka,et al.  Sequences and Codes: Fundamentals of Biomolecular Cryptology , 1994 .

[6]  C Saccone,et al.  Linguistic analysis of nucleotide sequences: algorithms for pattern recognition and analysis of codon strategy. , 1996, Methods in enzymology.

[7]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[8]  N. Jesper Larsson Structures of String Matching and Data Compression , 1999 .

[9]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[10]  S Lechat,et al.  Differences and similarities between various tandem repeat sequences: minisatellites and microsatellites. , 1997, Biochimie.

[11]  M S Gelfand,et al.  Genetic language: metaphore or analogy? , 1993, Bio Systems.

[12]  R. Fleischmann,et al.  DNA repeats identify novel virulence genes in Haemophilus influenzae. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[13]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[14]  E. Trifonov,et al.  Enhancement of the nucleosomal pattern in sequences of lower complexity. , 1997, Nucleic acids research.

[15]  C. Wills,et al.  Abundant microsatellite polymorphism in Saccharomyces cerevisiae, and the different distributions of microsatellites in eight prokaryotes and S. cerevisiae, result from strong mutation pressures and a variety of selective forces. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[17]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[18]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[19]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[20]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 2001, J. Comput. Biol..

[21]  S. Pietrokovski,et al.  Comparing nucleotide and protein sequences by linguistic methods. , 1994, Journal of biotechnology.

[22]  Alexander Bolshoy,et al.  Sequence Complexity and DNA Curvature , 1999, Comput. Chem..

[23]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[24]  N. Jesper Larsson Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[25]  G. Lauc,et al.  Entropies of coding and noncoding sequences of DNA and proteins. , 1992, Biophysical chemistry.

[26]  Alex van Belkum,et al.  Short-Sequence DNA Repeats in Prokaryotic Genomes , 1998, Microbiology and Molecular Biology Reviews.

[27]  C Saccone,et al.  Linguistic approaches to the analysis of sequence information. , 1994, Trends in biotechnology.

[28]  A K Konopka,et al.  Distance analysis and sequence properties of functional domains in nucleic acids and proteins. , 1988, Gene analysis techniques.

[29]  John C. Wootton,et al.  A Global Compositional Complexity Measure for Biological Sequences: AT-rich and GC-rich Genomes Encode Less Complex Proteins , 2000, Comput. Chem..