Simultaneous gene finding in multiple genomes

MOTIVATION As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. RESULTS The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. AVAILABILITY AND IMPLEMENTATION The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online.

[1]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[2]  Gordon Gremme,et al.  Engineering a software tool for gene structure prediction in higher organisms , 2005, Inf. Softw. Technol..

[3]  Burkhard Morgenstern,et al.  AUGUSTUS: ab initio prediction of alternative transcripts , 2006, Nucleic Acids Res..

[4]  Miklós Csűrös,et al.  On the Estimation of Intron Evolution , 2006, PLoS computational biology.

[5]  D. Bertsekas,et al.  Incremental subgradient methods for nondifferentiable optimization , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[6]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[7]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[8]  Knut Reinert,et al.  Antilope—A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[10]  Yasubumi Sakakibara,et al.  Prediction of Gene Structures from RNA-seq Data Using Dual Decomposition , 2015 .

[11]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[12]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[13]  L. Duysens Preprints , 1966, Nature.

[14]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[15]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[16]  Martin Kollmar,et al.  A novel hybrid gene prediction method employing protein multiple sequence alignments , 2011, Bioinform..

[17]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[18]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Katharina J. Hoff,et al.  Current methods for automated annotation of protein-coding genes. , 2015, Current opinion in insect science.

[20]  Alexander M. Rush,et al.  Dual Decomposition for Parsing with Non-Projective Head Automata , 2010, EMNLP.

[21]  Michael R. Brent,et al.  Eval: A software package for analysis of genome annotations , 2003, BMC Bioinformatics.

[22]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[23]  Genome 10 K : A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate Species GENOME 10 K COMMUNITY OF SCIENTISTS * , 2009 .

[24]  Alexander M. Rush,et al.  On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing , 2010, EMNLP.

[25]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[26]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[27]  Ernst Althaus,et al.  A Lagrangian Relaxation Approach for the Multiple Sequence Alignment Problem , 2007, COCOA.

[28]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[29]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[30]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[31]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[32]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[33]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[34]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[35]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.