The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies

BackgroundThe accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories.ResultsWe present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem.ConclusionsThe ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred.

[1]  Françoise Thibaud-Nissen,et al.  Eukaryotic Genome Annotation Pipeline , 2013 .

[2]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[3]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[4]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[5]  Taesung Park,et al.  Robust imputation method for missing values in microarray data , 2007, BMC Bioinformatics.

[6]  Gordon Gremme,et al.  Engineering a software tool for gene structure prediction in higher organisms , 2005, Inf. Softw. Technol..

[7]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[8]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[9]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[10]  Claudio Benicio Cardoso-Silva,et al.  Building the sugarcane genome for biotechnology and identifying evolutionary trends , 2014, BMC Genomics.

[11]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[12]  Les Dethlefsen,et al.  Differences in codon bias cannot explain differences in translational power among microbes , 2005, BMC Bioinformatics.

[13]  F. Eisenhaber,et al.  Data Mining Techniques for the Life Sciences , 2010, Methods in Molecular Biology.

[14]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[15]  Mauro Dell'Amico,et al.  Assignment Problems , 1998, IFIP Congress: Fundamentals - Foundations of Computer Science.

[16]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[17]  Mathew W. Wright,et al.  Guidelines for human gene nomenclature. , 2002, Genomics.

[18]  Y. Ushkaryov,et al.  The latrophilins, "split-personality" receptors. , 2010, Advances in experimental medicine and biology.

[19]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[20]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[21]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[22]  H. Schiöth,et al.  Defining the gene repertoire and spatiotemporal expression profiles of adhesion G protein-coupled receptors in zebrafish , 2015, BMC Genomics.

[23]  Roderic Guigó,et al.  Assembling Genes from Predicted Exons In Linear Time with Dynamic Programming , 1998, J. Comput. Biol..

[24]  S. Renninger,et al.  Cone arrestin confers cone vision of high temporal resolution in zebrafish larvae , 2011, The European journal of neuroscience.

[25]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[26]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[27]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[28]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[29]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[30]  K. Hatje,et al.  Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio , 2011, BMC Research Notes.

[31]  Asaf Levy,et al.  TranspoGene and microTranspoGene: transposed elements influence on the transcriptome of seven vertebrates and invertebrates , 2007, Nucleic Acids Res..

[32]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[33]  Valery Shepelev,et al.  Advances in the Exon-Intron Database (EID) , 2006, Briefings Bioinform..

[34]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[35]  G. Pavesi,et al.  Exalign: a new method for comparative analysis of exon–intron gene structures , 2008, Nucleic acids research.

[36]  Florian Odronitz,et al.  Scipio: Using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species , 2008, BMC Bioinformatics.

[37]  Richard M. Karp,et al.  Reducibility among combinatorial problems" in complexity of computer computations , 1972 .

[38]  Katja Nowick,et al.  Gain, Loss and Divergence in Primate Zinc-Finger Genes: A Rich Resource for Evolution of Gene Regulatory Differences between Species , 2011, PloS one.

[39]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[40]  S. Scherer,et al.  Guide to the human genome , 2010 .

[41]  Daniel H. Huson,et al.  Dendroscope: An interactive viewer for large phylogenetic trees , 2007, BMC Bioinformatics.

[42]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[43]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[44]  Martin Malmstrøm,et al.  Ancestral duplications and highly dynamic opsin gene evolution in percomorph fishes , 2014, Proceedings of the National Academy of Sciences.

[45]  Gabriele Sales,et al.  MAGIA2: from miRNA and genes expression data integrative analysis to microRNA–transcription factor mixed regulatory circuits (2012 update) , 2012, Nucleic Acids Res..

[46]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[47]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[48]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[49]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[50]  Itay Mayrose,et al.  ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function , 2013 .