An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence

A new version of the GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992; Uberbacher et al., 1993), called GRAIL II, has recently been developed (Xu et al., 1994). GRAIL II is a hybrid AI system that supports a number of DNA sequence analysis tools including protein-coding region recognition, PolyA site and transcription promoter recognition, gene model construction, translation to protein, and DNA/protein database searching capabilities. This paper presents the core of GRAIL II, the coding exon recognition and gene model construction algorithms. The exon recognition algorithm recognizes coding exons by combining coding feature analysis and edge signal (acceptor/donor/translation-start sites) detection. Unlike the original GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992), this algorithm uses variable-length windows tailored to each potential exon candidate, making its performance almost exon length-independent. In this algorithm, the recognition process is divided into four steps. Initially a large number of possible coding exon candidates are generated. Then a rule-based prescreening algorithm eliminates the majority of the improbable candidates. As the kernel of the recognition algorithm, three neural networks are trained to evaluate the remaining candidates. The outputs of the neural networks are then divided into clusters of candidates, corresponding to presumed exons. The algorithm makes its final prediction by picking the best canadidate from each cluster. The gene construction algorithm (Xu, Mural and Uberbacher, 1994) uses a dynamic programming approach to build gene models by using as input the clusters predicted by the exon recognition algorithm. Extensive testing has been done on these two algorithms.(ABSTRACT TRUNCATED AT 250 WORDS)