MOTIVATION
The process of determining the functional sequence content of an organism is confounded by several factors. Large protein coding sequences are relatively easy to find by statistical methods. Smaller proteins however may escape detection due to their size falling below some arbitrary researcher-defined minimum cutoff, or the inability to precisely define a promoter, or translational start (Delcher et al., Nucleic Acids Res., 27, 4636-4641, 1999). Promoter and regulatory sequences themselves are difficult to define due to a significant amount of allowable sequence variation, as well as a probable lack of any completely accurate whole-organismal gene catalogs to date. Finally, certain genes coding functional RNAs may have insufficient structural or sequence constraints to be detectable by normal sequence structure/pattern searching methods (Eddy and Rivas, Bioinformatics, 16, 583-605, 2000). In those cases where there are multiple closely related organisms that have been sequenced, there is additional information that may be used in the investigation of sequence content-that being the possible conserved nature of functional sequences between the organisms. We present a method for the utilization of this conserved information to detect genes and other potentially functional sequences that may be missed by standard ORF-calling, RNA finding, and pattern matching software. The tricross programs produce a multi-way cross comparison of three sets of sequences, determine which are conserved in all three sets, and produce a graphical (Virtual Reality Modelling Language-VRML; (ISO/IEC 14772-1: 1997, VDC), 1997) representation as well as alignments of all sequence triples found. The software can also be applied to a pair of sequence sets, though the noise in the results increases.
RESULTS
Tricross has been used to examine the intergenic-sequence content of the three archaeal Pyrococcus genomes to determine the most highly related sequences remaining between the annotated protein and RNA coding sequences. Set to relatively stringent similarity requirements for the search, tricross found 101 intergenic sequences conserved among the three organisms. Interestingly, 29 of these appear to contain members of a family of small RNA molecules (Kiss-Laszlo et al., EMBO J., 17, 797-807, 1998) only recently discovered in the Archaea (Armbruster, OSU, Diss., 1988; Omer et al., Science, 288, 517-522, 2000; Gaspin et al., J. Mol. Biol., 297, 895-906, 2000). While some of the remaining 72 appear to be individual highly conserved promoter sequences, others have no currently known biological significance. Although originally developed to facilitate the examination of intergenic sequences, none of the tricross logic is inherently specific to intergenic sequences. The software can also be applied to gene sequences, and has been used to produce inter-genomic gene order dot-plots for Haemophilus influenzae (Fleischmann et al., Science, 269, 496-512, 1995) versus H.ducreyi (unpublished data), and Neisseria meningiditis Z2491 (serogroup A) (Parkhill et al., Nature, 404, 502-506, 2000) versus Neisseria meningiditis Z58 (serogroup B) (Tettelin et al., Science, 287, 1809-1815, 2000) versus Neisseria gonorrhoeae (Lewis et al., http://micro-gen.ouhsc.edu/, 2000).
AVAILABILITY
The tricross software package is available from http://www.biosci.ohio-state.edu/~ray/bioinformatics/tricross.html.
CONTACT
ray@biosci.ohio-state.edu; daniels.7@osu.edu; munsonr@pediatrics.ohio-state.edu
SUPPLEMENTARY INFORMATION
Additional data from the cross-genomic comparisons examined in the discussion section are linked from http://www.biosci.ohio-state.edu/~ray/bioinformatics/tricross.html.
[1]
B. Barrell,et al.
Complete DNA sequence of a serogroup A strain of Neisseria meningitidis Z2491
,
2000,
Nature.
[2]
Z. Kiss-László,et al.
Sequence and structural elements of methylation guide snoRNAs essential for site‐specific ribose methylation of pre‐rRNA
,
1998,
The EMBO journal.
[3]
R. Fleischmann,et al.
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
,
1995,
Science.
[4]
J. Bachellerie,et al.
Archaeal homologs of eukaryotic methylation guide small nucleolar RNAs: lessons from the Pyrococcus genomes.
,
2000,
Journal of molecular biology.
[5]
F. Robb,et al.
Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3.
,
1998,
DNA research : an international journal for rapid publication of reports on genes and genomes.
[6]
S. Salzberg,et al.
Complete genome sequence of Neisseria meningitidis serogroup B strain MC58.
,
2000,
Science.
[7]
S. Eddy,et al.
Homologs of small nucleolar RNAs in Archaea.
,
2000,
Science.
[8]
J. Thompson,et al.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
,
1994,
Nucleic acids research.
[9]
Elena Rivas,et al.
Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs
,
2000,
Bioinform..
[10]
D. Lipman,et al.
Improved tools for biological sequence comparison.
,
1988,
Proceedings of the National Academy of Sciences of the United States of America.
[11]
J. Cannon,et al.
The physical map of the chromosome of a serogroup A strain of Neisseria meningitidis shows complex rearrangements relative to the chromosomes of the two mapped strains of the closely related species N. gonorrhoeae
,
1995,
Journal of bacteriology.
[12]
S. Salzberg,et al.
Improved microbial gene identification with GLIMMER.
,
1999,
Nucleic acids research.
[13]
William C. Ray,et al.
The PACRAT system: an extensible WWW-based system for correlated sequence retrieval, storage and analysis
,
2001,
Bioinform..