Using multiple reference genomes to identify and resolve annotation inconsistencies

Background Advances in sequencing technologies have led to the release of reference genomes and annotations for multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of synteny between each other, the annotated structure of gene models within these regions can differ. Of particular concern are split-gene misannotations, in which a single gene is incorrectly annotated as two distinct genes or two genes are incorrectly annotated as a single gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses. Results We developed a high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model. We demonstrate the utility of our method using gene annotations of three reference genomes from maize (B73, PH207, and W22), a difficult system from an annotation perspective due to the size and complexity of the genome. On average, we find several hundred of these potential split-gene misannotations in each pairwise comparison, corresponding to 3-5% of gene models across annotations. To determine which state (i.e. one gene or multiple genes) is biologically supported, we utilize RNAseq data from 10 tissues throughout development along with a novel metric and simulation framework. The methods we have developed require minimal human interaction and can be applied to future assemblies to aid in annotation efforts. Conclusions Split-gene misannotations occur at appreciable frequency in maize annotations. We have developed a method to easily identify and correct these misannotations. Importantly, this method is generic in that it can utilize any type of short-read expression data. Failure to account for split-gene misannotations has serious consequences for biological inference, particularly for expression-based analyses.

[1]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[2]  J. Boore,et al.  Gene annotation errors are common in the mammalian mitochondrial genomes database , 2019, BMC Genomics.

[3]  Giulia Antonazzo,et al.  FlyBase 2.0: the next generation , 2018, Nucleic Acids Res..

[4]  Lukas A. Mueller,et al.  A quick guide for student-driven community genome annotation , 2018, PLoS Comput. Biol..

[5]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[6]  J. Steitz,et al.  Readthrough transcription: How are DoGs made and what do they do? , 2017, RNA biology.

[7]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[8]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[9]  Carolyn J. Lawrence-Dill,et al.  MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations1[W][OPEN] , 2013, Plant Physiology.

[10]  Emily M. Strait,et al.  The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome , 2015, Genesis.

[11]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[12]  Karen Eilbeck,et al.  Quantitative measures for the management and comparison of annotated genomes , 2009, BMC Bioinformatics.

[13]  Jeffrey Ross-Ibarra,et al.  Improved maize reference genome with single-molecule technologies , 2017, Nature.

[14]  S. Salzberg,et al.  Using MUMmer to Identify Similar Regions in Large Sequence Sets , 2003, Current protocols in bioinformatics.

[15]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[16]  Michael I. Love,et al.  Differential analysis of count data – the DESeq2 package , 2013 .

[17]  Kevin L. Childs,et al.  Draft Assembly of Elite Inbred Line PH207 Provides Insights into Genomic and Transcriptome Diversity in Maize[OPEN] , 2016, Plant Cell.

[18]  W. Huber,et al.  Inferring differential exon usage in RNA-Seq data with the DEXSeq package , 2015 .

[19]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[20]  R. Sekhon,et al.  An Expanded Maize Gene Expression Atlas based on RNA Sequencing and its Use to Explore Root Development , 2016, The plant genome.

[21]  Jose Lugo-Martinez,et al.  Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies , 2014, PLoS Comput. Biol..

[22]  Adrian Tsang,et al.  Manual Gene Curation and Functional Annotation. , 2018, Methods in molecular biology.

[23]  Daniel L. Vera,et al.  The maize W22 genome provides a foundation for functional genomics and transposon biology , 2018, Nature Genetics.

[24]  Wolfgang Huber,et al.  Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15: 550 , 2014 .

[25]  Wolfgang Huber,et al.  BioC 2012: Analyzing RNA-seq data for dierential exon usage with the DEXSeq package , 2012 .

[26]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[27]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[28]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[29]  Günter P. Wagner,et al.  Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples , 2012, Theory in Biosciences.

[30]  Bo Wang,et al.  Gramene 2018: unifying comparative genomics and pathway resources for plant research , 2017, Nucleic Acids Res..

[31]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.