Determining the beginning and end positions of each exon in each protein coding gene within a genome can be difficult because the DNA patterns that signal a gene’s presence have multiple weakly related alternate forms and the DNA fragments that comprise a gene are generally small in comparison to the size of the genome. In response to this challenge, automated gene predictors were created to generate putative gene structures. N SCAN identifies gene structures in a target DNA sequence and can use conservation patterns learned from alignments between a target and one or more informant DNA sequences. N SCAN uses a Bayesian network, generated from a phylogenetic tree, to probabilistically relate the target sequence to the aligned sequence(s). Phylogenetic substitution models are used to estimate substitution likelihood along the branches of the tree. Although N SCAN’s predictive accuracy is already a benchmark for de novo HMM based gene predictors, optimizing its use of substitution models will allow for improved conservation pattern estimates leading to even better accuracy. Selecting optimal substitution models requires avoiding overfitting as more detailed models require more free parameters; unfortunately, the number of parameters is limited by the number of known genes available for parameter estimation (training). In order to optimize substitution model selection, we tested eight Type of Report: Other Department of Computer Science & Engineering Washington University in St. Louis Campus Box 1045 St. Louis, MO 63130 ph: (314) 935-6160 1 Optimization of Gene Prediction via More Accurate Phylogenetic Substitution Models Ezekiel Maier, Randall H Brown, and Michael R Brent Department of Computer Science and Engineering, Washington University, Saint Louis, MO, 63130 Abstract: Determining the beginning and end positions of each exon in each protein coding gene within a genome can be difficult because the DNA patterns that signal a gene’s presence have multiple weakly related alternate forms and the DNA fragments that comprise a gene are generally small in comparison to the size of the genome. In response to this challenge, automated gene predictors were created to generate putative gene structures. N-SCAN identifies gene structures in a target DNA sequence and can use conservation patterns learned from alignments between a target and one or more informant DNA sequences. N-SCAN uses a Bayesian network, generated from a phylogenetic tree, to probabilistically relate the target sequence to the aligned sequence(s). Phylogenetic substitution models are used to estimate substitution likelihood along the branches of the tree. Although N-SCAN’s predictive accuracy is already a benchmark for de novo HMM based gene predictors, optimizing its use of substitution models will allow for improved conservation pattern estimates leading to even better accuracy. Selecting optimal substitution models requires avoiding overfitting as more detailed models require more free parameters; unfortunately, the number of parameters is limited by the number of known genes available for parameter estimation (training). In order to optimize substitution model selection, we tested eight models on the entire genome including General, Reversible, HKY, Jukes-Cantor, and Kimura. In addition to testing models on the entire genome, genome feature based model selection strategies were investigated by assessing the ability of each model to accurately reflex the unique conservation patterns present in each genome region. Context dependency was examined using Determining the beginning and end positions of each exon in each protein coding gene within a genome can be difficult because the DNA patterns that signal a gene’s presence have multiple weakly related alternate forms and the DNA fragments that comprise a gene are generally small in comparison to the size of the genome. In response to this challenge, automated gene predictors were created to generate putative gene structures. N-SCAN identifies gene structures in a target DNA sequence and can use conservation patterns learned from alignments between a target and one or more informant DNA sequences. N-SCAN uses a Bayesian network, generated from a phylogenetic tree, to probabilistically relate the target sequence to the aligned sequence(s). Phylogenetic substitution models are used to estimate substitution likelihood along the branches of the tree. Although N-SCAN’s predictive accuracy is already a benchmark for de novo HMM based gene predictors, optimizing its use of substitution models will allow for improved conservation pattern estimates leading to even better accuracy. Selecting optimal substitution models requires avoiding overfitting as more detailed models require more free parameters; unfortunately, the number of parameters is limited by the number of known genes available for parameter estimation (training). In order to optimize substitution model selection, we tested eight models on the entire genome including General, Reversible, HKY, Jukes-Cantor, and Kimura. In addition to testing models on the entire genome, genome feature based model selection strategies were investigated by assessing the ability of each model to accurately reflex the unique conservation patterns present in each genome region. Context dependency was examined using zeroth, first, and second order models. All models were tested on the human and D. melanogaster genomes. Analysis of the data suggests that the nucleotide equilibrium frequency assumption (denoted as i) is the strongest predictor of a model’s accuracy, followed by reversibility and transition/transversion inequality. Furthermore, second order models are shown to give an average of 0.6% improvement over first order models, which give an 18% improvement over zeroth order models. Finally, by limiting parameter usage by the number of training examples available for each feature, genome feature based model selection better estimates substitution likelihood leading to a significant improvement in N-SCAN’s gene annotation accuracy.
[1]
Tatiana A. Tatusova,et al.
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
,
2004,
Nucleic Acids Res..
[2]
E. Frise,et al.
Sequence Finishing and Mapping of Drosophila melanogaster Heterochromatin
,
2007,
Science.
[3]
Charles J. Vaske,et al.
Gene prediction and verification in a compact genome with numerous small introns.
,
2004,
Genome research.
[4]
S. Hess,et al.
The influence of nearest neighbors on the rate and pattern of spontaneous point mutations
,
1992,
Journal of Molecular Evolution.
[5]
Irmtraud M. Meyer,et al.
An evolutionary model for protein-coding regions with conserved RNA structure.
,
2004,
Molecular biology and evolution.
[6]
H. Kishino,et al.
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA
,
2005,
Journal of Molecular Evolution.
[7]
Wei Zhu,et al.
Improvement of whole-genome annotation of cereals through comparative analyses.
,
2007,
Genome research.
[8]
Ziheng Yang.
Estimating the pattern of nucleotide substitution
,
1994,
Journal of Molecular Evolution.
[9]
S. Jeffery.
Evolution of Protein Molecules
,
1979
.
[10]
B. Blaisdell.
A method of estimating from two aligned present-day DNA sequences their ancestral composition and subsequent rates of substitution, possibly different in the two lineages, corrected for multiple and parallel substitutions at the same site
,
2005,
Journal of Molecular Evolution.
[11]
J. Felsenstein.
Evolutionary trees from DNA sequences: A maximum likelihood approach
,
2005,
Journal of Molecular Evolution.
[12]
Ian Korf,et al.
Integrating genomic homology into gene structure prediction
,
2001,
ISMB.
[13]
M. Kimura.
A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences
,
1980,
Journal of Molecular Evolution.
[14]
M. Nei,et al.
Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.
,
1993,
Molecular biology and evolution.
[15]
P. Lio’,et al.
Models of molecular evolution and phylogeny.
,
1998,
Genome research.
[16]
S. Karlin,et al.
Finding the genes in genomic DNA.
,
1998,
Current opinion in structural biology.
[17]
M. Brent,et al.
Recent advances in gene structure prediction.
,
2004,
Current opinion in structural biology.
[18]
M. Brent.
How does eukaryotic gene prediction work?
,
2007,
Nature Biotechnology.
[19]
T. Jukes.
CHAPTER 24 – Evolution of Protein Molecules
,
1969
.
[20]
Michael R. Brent,et al.
Eval: A software package for analysis of genome annotations
,
2003,
BMC Bioinformatics.
[21]
Tatiana Tatusova,et al.
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
,
2004,
Nucleic Acids Res..
[22]
Ryan D. Morin,et al.
The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC).
,
2004,
Genome research.
[23]
Richard A. Moore,et al.
The completion of the Mammalian Gene Collection (MGC)
,
2009
.
[24]
D. Haussler,et al.
Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.
,
2003,
Molecular biology and evolution.
[25]
Michael R. Brent,et al.
Using Multiple Alignments to Improve Gene Prediction
,
2005,
RECOMB.
[26]
S. Tavaré.
Some probabilistic and statistical problems in the analysis of DNA sequences
,
1986
.
[27]
S. Karlin,et al.
Prediction of complete gene structures in human genomic DNA.
,
1997,
Journal of molecular biology.
[28]
V. B. Yap,et al.
Modeling DNA Base Substitution in Large Genomic Regions from Two Organisms
,
2003,
Journal of Molecular Evolution.
[29]
Chuong B. Do,et al.
CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction
,
2007,
Genome Biology.
[30]
M. Brent,et al.
Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map.
,
2003,
Genome research.
[31]
S T Hess,et al.
Wide variations in neighbor-dependent substitution rates.
,
1994,
Journal of molecular biology.
[32]
Gos Micklem,et al.
Supporting Online Material Materials and Methods Figs. S1 to S50 Tables S1 to S18 References Identification of Functional Elements and Regulatory Circuits by Drosophila Modencode
,
2022
.
[33]
M. Brent.
Steady progress and recent breakthroughs in the accuracy of automated genome annotation
,
2008,
Nature Reviews Genetics.
[34]
J. Bonfield,et al.
Finishing the euchromatic sequence of the human genome
,
2004,
Nature.