Dataset Construction for Gene Structure Prediction and Alternative Splicing Analysis

The performance of gene finding from genome sequences strongly depends on the accuracy of splice site prediction. Recent gene finding programs, however, still do not reach enough levels. To improve the accuracy of splice site prediction, it is required to understand the splicing mechanism and to make a model from clear experimental evidences. For this purpose, genomic full-length precursor mRNA sequences (FL-pre-mRNAs), together with expression information are indispensable. The FL-pre-mRNAs have entire gene structure such as the 5’ and 3’ end of mRNA, initiation codon, splice sites, stop codon, and polyadenylation signals, etc. They also contain all the alternative splice sites except the first or last exons in alternative transcripts. However, databases of FL-pre-mRNAs are still not reported in previous works. Aligning expressed sequence tags (ESTs) to the genomic sequences has been a common method for gene prediction or splice site analysis (1, 3). However, ESTs are not suitable for collecting FL-premRNAs because ESTs are partial sequences and the 5’ ends of mRNAs are unknown in most cases, and even EST contigs clustered in UniGene (2) or RefSeq database (4) are not evident to be full-length. It is because ESTs are single sequencing reads that contain mutations, insertions, or deletions (5). Growing genomic and EST sequence data, computational approach has become one of methods to annotate the sequences as putative genes or ORFs. Whereas, Genbank database has accumulated the entries in which genomic complete protein-coding sequences or full-length mRNA sequences are characterized by experimental evidence. The sequences and the annotation (the positions of gene boundaries and functional signals) with the information more reliable than that determined by in silico prediction are expected to be high quality. Thus, we constructed datasets with experimental annotation from Genbank database for gene structure prediction and splice site analysis. Moreover, the analysis for constitutive and alternative splice sites with the correlation with several biological descriptors will be discussed.