Sim4cc: a cross-species spliced alignment program

Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64 000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.

[1]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[2]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[3]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[4]  Webb Miller,et al.  Detecting functional regions in dna with sequence comparison methods: program development and evaluation , 2000 .

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[7]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[8]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[9]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[10]  Miriam K. Konkel,et al.  Genome analysis of the platypus reveals unique signatures of evolution , 2008, Nature.

[11]  Ying Wang,et al.  Insights into social insects from the genome of the honeybee Apis mellifera , 2006, Nature.

[12]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[13]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[14]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[15]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[16]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[17]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  C. Burge,et al.  Prediction of Mammalian MicroRNA Targets , 2003, Cell.

[19]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[20]  S. Jackson,et al.  Comparative Sequencing of Plant Genomes: Choices to Make , 2006, The Plant Cell Online.

[21]  Wei Zhu,et al.  Improvement of whole-genome annotation of cereals through comparative analyses. , 2007, Genome research.

[22]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[23]  Leming Zhou,et al.  Designing Sensitive and Specific Spaced Seeds for Cross-Species mRNA-to-Genome Alignment , 2007, J. Comput. Biol..

[24]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[25]  Tin Wee Tan,et al.  MGAlignIt: a web service for the alignment of mRNA/EST and genomic sequences , 2003, Nucleic Acids Res..

[26]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[29]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[30]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[31]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[32]  Steven L Salzberg,et al.  Computational discovery of internal micro-exons. , 2003, Genome research.

[33]  Kun-Mao Chao,et al.  A tool for aligning very similar DNA sequences , 1997, Comput. Appl. Biosci..

[34]  A. Oskooi Molecular Evolution and Phylogenetics , 2008 .

[35]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[36]  G. Sutton,et al.  Gene and alternative splicing annotation with AIR. , 2005, Genome research.

[37]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[38]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[39]  Miao Zhang,et al.  Improved spliced alignment from an information theoretic approach , 2006, Bioinform..

[40]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[41]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[42]  Leming Zhou,et al.  Effective cluster-based seed design for cross-species sequence comparisons , 2008, Bioinform..

[43]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[44]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[45]  Leming Zhou,et al.  Universal seeds for cDNA-to-genome comparison , 2007, BMC Bioinformatics.