Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster.

Overall compositional properties at the level of bases, dinucleotides and longer oligos characterize genomes of different species. In Caenorhabditis elegans, using recurrence analysis, we recognized the existence of a long-range correlation in the oligonucleotide usage of introns and intergenic regions. Through correlation analysis, this is confirmed here to be a genome-wide property of C. elegans non-coding portions. We then investigate the possibility of extracting a typical vocabulary through statistical analysis of experimentally confirmed introns of sufficient length (>1 kb), deprived of known splice signals, the focus being on distributed lexical features rather than on localized motifs. Lexical preferences typical of introns could be exposed using principal component analysis of pentanucleotide frequency distributions, both in C. elegans and in Drosophila melanogaster. In either species, the introns' pentamer preferences are largely shared by intergenic tracts. The pentamer vocabularies extracted for the two species exhibit interesting symmetry properties and overlap in part. A more extensive investigation of the interspecies relationship at the level of oligonucleotide preferences in non-coding regions, not related by sequence similarity, might form the basis of new approaches for the study of the evolutionary behaviour of these regions.

[1]  V. Prabhu Symmetry observations in long nucleotide sequences. , 1993, Nucleic acids research.

[2]  E N Trifonov,et al.  Intervening sequences exhibit distinct vocabulary. , 1986, Journal of biomolecular structure & dynamics.

[3]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[4]  C. Burge,et al.  A computational analysis of sequence features involved in recognition of short introns , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Eivind Coward,et al.  Shufflet: shuffling sequences while conserving the k-let counts , 1999, Bioinform..

[6]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  E. Pizzi,et al.  Similarity in oligonucleotide usage in introns and intergenic regions contributes to long-range correlation in the Caenorhabditis elegans genome. , 1999, Gene.

[8]  Iraj Daizadeh,et al.  EID: the Exon?Intron Database?an exhaustive database of protein-coding intron-containing genes , 2000, Nucleic Acids Res..

[9]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[10]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[11]  G. Bernardi,et al.  The human genome: organization and evolutionary history. , 1995, Annual review of genetics.

[12]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[13]  G Bernardi,et al.  The mosaic genome of warm-blooded vertebrates. , 1985, Science.

[14]  S Karlin,et al.  Genome-scale compositional comparisons in eukaryotes. , 2001, Genome research.

[15]  J. Mortimer,et al.  Chargaff's legacy. , 2000, Gene.

[16]  C. Glover,et al.  Gene expression profiling for hematopoietic cell culture , 2006 .

[17]  Jean-Michel Claverie,et al.  Heuristic informational analysis of sequences , 1986, Nucleic Acids Res..

[18]  Pierre Baldi,et al.  Why are complementary DNA strands symmetric? , 2002, Bioinform..