Gene recognition based on DAG shortest paths

We describe DAGGER, an ab initio gene recognition program which combines the output of high dimensional signal sensors in an intuitive gene model based on directed acyclic graphs. In the first stage, candidate start, donor, acceptor, and stop sites are scored using the SNoW learning architecture. These sites are then used to generate a directed acyclic graph in which each source-sink path represents a possible gene structure. Training sequences are used to optimize an edge weighting function so that the shortest source-sink path maximizes exon-level prediction accuracy. Experimental evaluation of prediction accuracy on two benchmark data sets demonstrates that DAGGERis competitive with ab initio gene finding programs based on Hidden Markov Models.

[1]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[2]  Jose Luis Esteves dos Santos,et al.  A New Shortest Paths Ranking Algorithm , 1999 .

[3]  Dan Roth,et al.  Part of Speech Tagging Using a Network of Linear Separators , 1998, ACL.

[4]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[5]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[6]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[7]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[8]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.

[9]  Kenneth H. Fasman,et al.  Finding Genes in Human DNA with a Hidden Markov Model , 1996, ISMB 1996.

[10]  E. Martins,et al.  An algorithm for the ranking of shortest paths , 1993 .

[11]  Dan Roth,et al.  A Learning Approach to Shallow Parsing , 1999, EMNLP.

[12]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[13]  Kevin Burrage,et al.  ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome , 2000, Nature Genetics.

[14]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[15]  Nick Littlestone,et al.  Comparing Several Linear-threshold Learning Algorithms on Tasks Involving Superfluous Attributes , 1995, ICML.

[16]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[17]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[18]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[19]  Dan Roth,et al.  Splice Site Prediction Using a Sparse Network of Winnows , 2001 .

[20]  Margaret H. Wright,et al.  Direct search methods: Once scorned, now respectable , 1996 .

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[23]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[24]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[25]  D. Black Protein Diversity from Alternative Splicing A Challenge for Bioinformatics and Post-Genome Biology , 2000, Cell.

[26]  David Haussler,et al.  Computational Gene nding , 1998 .

[27]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[28]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[29]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[30]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[31]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[32]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[33]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[34]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[35]  T. Richmond Gene recognition via spliced alignment , 2000, Genome Biology.