JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

BackgroundPredicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.ResultsHere we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.ConclusionWhile in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.

[1]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[2]  H. Lodish Molecular Cell Biology , 1986 .

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[5]  H. Prydz,et al.  CpG islands as gene markers in the human genome. , 1992, Genomics.

[6]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[7]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[8]  T. Heinemeyer,et al.  TRANSFAC, TRRD and COMPEL: towards a federated database system on transcriptional regulation , 1997, Nucleic Acids Res..

[9]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[10]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[12]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[13]  Akif Uzman,et al.  Molecular Cell Biology (4th edition): Harvey Lodish, Arnold Berk, S. Lawrence Zipursky, Paul Matsudaira, David Baltimore and James Darnell; Freeman & Co., New York, NY, 2000, 1084 pp., list price $102.25, ISBN 0-7167-3136-3 , 2001 .

[14]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[15]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[16]  Rajat Raina,et al.  Classification with Hybrid Generative/Discriminative Models , 2003, NIPS.

[17]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[18]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[19]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[20]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[21]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[22]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[23]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[24]  Michael Hackenberg,et al.  IsoFinder: computational prediction of isochores in genome sequences , 2004, Nucleic Acids Res..

[25]  Steven Salzberg,et al.  Efficient decoding algorithms for generalized hidden Markov model gene finders , 2005, BMC Bioinformatics.

[26]  Steven Salzberg,et al.  An empirical analysis of training protocols for probabilistic gene finders , 2005, BMC Bioinformatics.

[27]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[28]  John Quackenbush,et al.  The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes , 2004, Nucleic Acids Res..

[29]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.