Syntenic global alignment and its application to the gene prediction problem

Given the increasing number of available genomic sequences, one now faces the task of identifying their protein coding regions. The gene prediction problem can be addressed in several ways, and one of the most promising methods makes use of information derived from the comparison of homologous sequences. In this work, we develop a new comparative-based gene prediction program, called Exon_Finder2. This tool is based on a new type of alignment we propose, called syntenic global alignment, that can deal satisfactorily with sequences that share regions with different rates of conservation. In addition to this new type of alignment itself, we also describe a dynamic programming algorithm that computes a best syntenic global alignment of two sequences, as well as its related score. The applicability of our approach was validated by the promising initial results achieved by Exon_Finder2. On a benchmark including 120 pairs of human and mouse genomic sequences, most of their encoded genes were successfully identified by our program.

[1]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[2]  Omid Abbasi,et al.  RESEARCH ARTICLE Open Access Identification of exonic regions in DNA sequences , 2022 .

[3]  J. C. Shepherd Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Brutlag,et al.  Dynamic use of multiple parameter sets in sequence alignment , 2006, Nucleic acids research.

[5]  Leping Li,et al.  Accurate anchoring alignment of divergent sequences , 2006, Bioinform..

[6]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[7]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[8]  M. S. Gel'fand,et al.  A COMBINATORIAL ALGORITHM FOR HIGHLY SPECIFIC RECOGNITION OF PROTEIN-CODING REGIONS IN HIGHER EUKARYOTIC DNA SEQUENCES , 1997 .

[9]  Ke Wang,et al.  genBlastG: using BLAST searches to build homologous gene models , 2011, Bioinform..

[10]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[11]  Pierre Rouzé,et al.  Orphan gene finding - an exon assembly approach , 2003, Theor. Comput. Sci..

[12]  Alexander Souvorov,et al.  Splign: algorithms for computing spliced alignments with identification of paralogs , 2008, Biology Direct.

[13]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[14]  Mikhail S. Gelfand,et al.  Gene recognition in eukaryotic DNA by comparison of genomic sequences , 2001, Bioinform..

[15]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[16]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[17]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[18]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[19]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[20]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[22]  Simon Cawley,et al.  Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat. , 2004, Genome research.

[23]  J. Manley,et al.  Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches , 2009, Nature Reviews Molecular Cell Biology.

[24]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[25]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[26]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Stephen Winters-Hilt,et al.  A Metastate HMM with Application to Gene Structure Identification in Eukaryotes , 2010, EURASIP J. Adv. Signal Process..

[28]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[29]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[30]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[31]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[32]  O. Gotoh,et al.  A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence , 2008, Nucleic acids research.

[33]  A. Krogh,et al.  Using database matches with for HMMGene for automated gene detection in Drosophila. , 2000, Genome research.

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Florian Odronitz,et al.  Scipio: Using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species , 2008, BMC Bioinformatics.

[36]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[37]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[38]  Leming Zhou,et al.  Sim4cc: a cross-species spliced alignment program , 2009, Nucleic acids research.

[39]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[40]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[41]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[42]  Roy D. Sleator,et al.  An overview of the current status of eukaryote gene prediction strategies. , 2010, Gene.

[43]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[44]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[45]  Martin Kollmar,et al.  A novel hybrid gene prediction method employing protein multiple sequence alignments , 2011, Bioinform..

[46]  Alexander Sczyrba,et al.  AGenDA: gene prediction by cross-species sequence comparison , 2004, Nucleic Acids Res..

[47]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[48]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[49]  Jing Wu,et al.  Coding Exon Detection Using Comparative Sequences , 2006, J. Comput. Biol..

[50]  J. Harrow,et al.  Identifying protein-coding genes in genomic sequences , 2009, Genome Biology.

[51]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[52]  Paola Bonizzoni,et al.  PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text , 2012, BMC Bioinformatics.

[53]  Carlos Eduardo Ferreira,et al.  A gene prediction algorithm using the spliced alignment problem , 2003 .