GMAP: a genomic mapping and alignment program for mRNA and EST sequence

MOTIVATION We introduce GMAP, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. RESULTS On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeqer. In these experiments, GMAP demonstrated a several-fold increase in speed over existing programs. AVAILABILITY Source code for gmap and associated programs is available at http://www.gene.com/share/gmap SUPPLEMENTARY INFORMATION http://www.gene.com/share/gmap.

[1]  C. Burge,et al.  A computational analysis of sequence features involved in recognition of short introns , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[3]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[4]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[5]  Byungkook Lee,et al.  Finding fusion genes resulting from chromosome rearrangement by analyzing the expressed sequence databases. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  H. Jacob,et al.  EbEST: an automated tool using expressed sequence tags to delineate gene structure. , 1998, Genome research.

[7]  L. Milanesi,et al.  ESTMAP: a system for expressed sequence tags mapping on genomic sequences , 2003, IEEE Transactions on NanoBioscience.

[8]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[9]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[10]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[11]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[12]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[13]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[14]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[15]  T. Richmond Gene recognition via spliced alignment , 2000, Genome Biology.

[16]  S. Berget,et al.  A 5′ Splice Site-Proximal Enhancer Binds SF1 and Activates Exon Bridging of a Microexon , 2000, Molecular and Cellular Biology.

[17]  V. Solovyev Structure, Properties and Computer Identification of Eukaryotic Genes , 2004 .

[18]  M. Kanehisa,et al.  Prediction of splice junctions in mRNA sequences. , 1985, Nucleic acids research.

[19]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[20]  Gane Ka-Shu Wong,et al.  Minimal introns are not "junk". , 2002, Genome research.

[21]  M S Gelfand,et al.  Statistical analysis of mammalian pre-mRNA splicing sites. , 1989, Nucleic acids research.

[22]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[23]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[24]  Kevin Atteson,et al.  Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences , 1998, ISMB.

[25]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[26]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[27]  Tin Wee Tan,et al.  MGAlignIt: a web service for the alignment of mRNA/EST and genomic sequences , 2003, Nucleic Acids Res..

[28]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[29]  James M. Eldred,et al.  The DNA sequence of human chromosome 7 , 2003, Nature.

[30]  F. Mertens,et al.  A novel FUS/CHOP chimera in myxoid liposarcoma. , 2000, Biochemical and biophysical research communications.

[31]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[32]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[33]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[34]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[35]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[36]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[37]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[38]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Mikhail S. Gelfand,et al.  Gene recognition in eukaryotic DNA by comparison of genomic sequences , 2001, Bioinform..

[40]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[41]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[42]  Terry Gaasterland,et al.  Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. , 2002, Genome research.

[43]  F. Wright,et al.  Assembly, annotation, and integration of UNIGENE clusters into the human genome draft. , 2001, Genome research.

[44]  M. Shapero,et al.  MARA: a novel approach for highly multiplexed locus-specific SNP genotyping using high-density DNA oligonucleotide arrays. , 2004, Nucleic acids research.

[45]  Gene W. Yeo,et al.  Variation in alternative splicing across human tissues , 2004, Genome Biology.

[46]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[47]  Burkhard Morgenstern,et al.  AGenDA: Gene prediction by comparative sequence analysis , 2002, Silico Biol..

[48]  R. Sorek,et al.  A novel algorithm for computational identification of contaminated EST libraries. , 2003, Nucleic acids research.

[49]  Stephen M. Mount,et al.  A catalogue of splice junction sequences. , 1982, Nucleic acids research.

[50]  Burkhard Morgenstern,et al.  Gene prediction by comparative sequence analysis , 2001, German Conference on Bioinformatics.

[51]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[52]  B. Wieringa,et al.  A minimal intron length but no specific internal sequence is required for splicing the large rabbit β-globin intron , 1984, Cell.

[53]  K. Zinn,et al.  Alternative splicing of micro-exons creates multiple forms of the insect cell adhesion molecule fasciclin I , 1992, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[54]  Aleksandar Milosavljevic,et al.  Pash: efficient genome-scale sequence anchoring by Positional Hashing. , 2004, Genome research.

[55]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[56]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[57]  G. Sutton,et al.  Gene and alternative splicing annotation with AIR. , 2005, Genome research.

[58]  Tin Wee Tan,et al.  MGAlign, a Reduced Search Space Approach to the Alignment of mRNA Sequences to Genomic Sequences , 2003 .

[59]  X Huang,et al.  Fast comparison of a DNA sequence with a protein sequence database. , 1996, Microbial & comparative genomics.

[60]  Wei Zhu,et al.  Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus , 2004, Bioinform..

[61]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[62]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[63]  Ney Lemke,et al.  Essentiality and damage in metabolic networks , 2004, Bioinform..

[64]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[65]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[66]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[67]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[68]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[69]  Y Sakaki,et al.  The DNA sequence of human chromosome 21. , 2000, Nature.

[70]  B. Roe,et al.  FELINES: a utility for extracting and examining EST-defined introns and exons. , 2003, Nucleic acids research.

[71]  Sanghyuk Lee,et al.  ASmodeler: gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences , 2004, Nucleic Acids Res..

[72]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[73]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[74]  J. Mattick Genome research , 1990, Nature.

[75]  David States,et al.  Selecting for functional alternative splices in ESTs. , 2002, Genome research.

[76]  Alistair G. Rust,et al.  Ensembl 2002: accommodating comparative genomics , 2003, Nucleic Acids Res..

[77]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[78]  Enno Ohlebusch,et al.  An Applications-focused Review of Comparative Genomics Tools: Capabilities, Limitations and Future Challenges , 2003, Briefings Bioinform..

[79]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[80]  Hongyu Zhang,et al.  Alignment of BLAST High-scoring Segment Pairs Based on the Longest Increasing Subsequence Algorithm , 2003, Bioinform..

[81]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[82]  V. Brendel,et al.  GeneSeqer@PlantGDB: Gene structure prediction in plant genomes. , 2003, Nucleic acids research.

[83]  V. Brendel,et al.  Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. , 1998, Nucleic acids research.

[84]  S. Berget,et al.  In vivo recognition of a vertebrate mini-exon as an exon-intron-exon unit , 1993, Molecular and cellular biology.

[85]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[86]  Steven L Salzberg,et al.  Computational discovery of internal micro-exons. , 2003, Genome research.

[87]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[88]  J. Brown,et al.  Requirements for mini-exon inclusion in potato invertase mRNAs provides evidence for exon-scanning interactions in plants. , 2000, RNA.

[89]  Osamu Gotoh,et al.  Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps , 2000, Bioinform..

[90]  E. Arner,et al.  Correcting errors in shotgun sequences. , 2003, Nucleic acids research.

[91]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[92]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[93]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[94]  M. Gerstein,et al.  Large-scale analysis of pseudogenes in the human genome. , 2004, Current opinion in genetics & development.

[95]  William H. Majoros,et al.  A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome , 2002, Science.

[96]  Junjun Zhang,et al.  Human Chromosome 7: DNA Sequence and Biology , 2003, Science.

[97]  Qunfeng Dong,et al.  GeneSeqer add PlantGDB: gene structure prediction in plant genomes , 2003, Nucleic Acids Res..

[98]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[99]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[100]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[101]  Shinichi Morishita,et al.  A Fast and Sensitive Algorithm for Aligning Ests to the Human Genome , 2003, J. Bioinform. Comput. Biol..