Optimal spliced alignment of homologous cDNA to a genomic DNA template

MOTIVATION Supplementary cDNA or EST evidence is often decisive for discriminating between alternative gene predictions derived from computational sequence inspection by any of a number of requisite programs. Without additional experimental effort, this approach must rely on the occurrence of cognate ESTs for the gene under consideration in available, generally incomplete, EST collections for the given species. In some cases, particular exon assignments can be supported by sequence matching even if the cDNA or EST is produced from non-cognate genomic DNA, including different loci of a gene family or homologous loci from different species. However, marginally significant sequence matching alone can also be misleading. We sought to develop an algorithm that would simultaneously score for predicted intrinsic splice site strength and sequence matching between the genomic DNA template and a related cDNA or EST. In this case, weakly predicted splice sites may be chosen for the optimal scoring spliced alignment on the basis of surrounding sequence matching. Strongly predicted splice sites will enter the optimal spliced alignment even without strong sequence matching. RESULTS We designed a novel algorithm that produces the optimal spliced alignment of a genomic DNA with a cDNA or EST based on scoring for both sequence matching and intrinsic splice site strength. By example, we demonstrate that this combined approach appears to improve gene prediction accuracy compared with current methods that rely only on either search by content and signal or on sequence similarity. AVAILABILITY The algorithm is available as a C subroutine and is implemented in the SplicePredictor and GeneSeqer programs. The source code is available via anonymous ftp from ftp. zmdb.iastate.edu. Both programs are also implemented as a Web service at http://gremlin1.zool.iastate.edu/cgi-bin/s p.cgiand http://gremlin1.zool.iastate.edu/cgi-bin/g s.cgi, respectively. CONTACT vbrendel@iastate.edu

[1]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[2]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[3]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[4]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[5]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Daniel R. Gallie,et al.  A look beyond transcription : mechanisms determining mRNA stability and translation in plants , 1998 .

[7]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[8]  X. Huang,et al.  On global sequence alignment , 1994, Comput. Appl. Biosci..

[9]  V. Brendel,et al.  Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. , 2000, Journal of molecular biology.

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  V. Brendel,et al.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. , 1996, Nucleic acids research.

[12]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[13]  J. Zhang,et al.  Methods for comparing a DNA sequence with a protein sequence , 1996, Comput. Appl. Biosci..

[14]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[15]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[16]  V. Brendel,et al.  Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. , 1998, Nucleic acids research.

[17]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[18]  S Karlin,et al.  Bacterial classifications derived from recA protein sequence comparisons , 1995, Journal of bacteriology.

[19]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[20]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.