Splice site prediction with quadratic discriminant analysis using diversity measure.

Based on the conservation of nucleotides at splicing sites and the features of base composition and base correlation around these sites we use the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to study the dependence structure of splicing sites and predict the exons/introns and their boundaries for four model genomes: Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and human. The comparison of compositional features between two sequences and the comparison of base dependencies at adjacent or non-adjacent positions of two sequences can be integrated automatically in the increment of diversity (ID). Eight feature variables around a potential splice site are defined in terms of ID. They are integrated in a single formal framework given by IDQD. In our calculations 7 (8) base region around the donor (acceptor) sites have been considered in studying the conservation of nucleotides and sequences of 48 bp on either side of splice sites have been used in studying the compositional and base-correlating features. The windows are enlarged to 16 (donor), 29 (acceptor) and 80 bp (either side) to improve the prediction for human splice sites. The prediction capability of the present method is comparable with the leading splice site detector--GeneSplicer.

[1]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[2]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[3]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[4]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[5]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[6]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[7]  C. Burge Chapter 8 – Modeling dependencies in pre-mRNA splicing signals , 1998 .

[8]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[9]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[10]  Li Xiao The Recognition of Protein Structural Class , 2002 .

[11]  S. Saxonov,et al.  Comparison of intron-containing and intron-lacking human genes elucidates putative exonic splicing enhancers. , 2001, Nucleic acids research.

[12]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[13]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[14]  G. Stormo Gene-finding approaches for eukaryotes. , 2000, Genome research.

[15]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[16]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[17]  Rodger Staden,et al.  The current status and portability of our sequence handling software , 1986, Nucleic Acids Res..

[18]  Kiyoshi Asai,et al.  Modeling splicing sites with pairwise correlations , 2002, ECCB.

[19]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[20]  Iraj Daizadeh,et al.  EID: the Exon?Intron Database?an exhaustive database of protein-coding intron-containing genes , 2000, Nucleic Acids Res..

[21]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[22]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[23]  Q. Z. Li,et al.  The prediction of the structural class of protein: application of the measure of diversity. , 2001, Journal of theoretical biology.

[24]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[25]  V. Brendel,et al.  Prediction of splice sites in plant pre-mRNA from sequence properties. , 1998, Journal of molecular biology.

[26]  Ludmila I. Kuncheva,et al.  Relationships between combination methods and measures of diversity in combining classifiers , 2002, Inf. Fusion.

[27]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[28]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[29]  Liaofu Luo,et al.  ORF Organization and Gene Recognition in the Yeast Genome , 2003, Comparative and functional genomics.