Choosing the Best Gene Predictions with GeneValidator.

GeneValidator is a tool for determining whether the characteristics of newly predicted protein-coding genes are consistent with those of similar sequences in public databases. For this, it runs up to seven comparisons per gene. Results are shown in an HTML report containing summary statistics and graphical visualizations that aim to be useful for curators. Results are also presented in CSV and JSON formats for automated follow-up analysis.Here, we describe common usage scenarios of GeneValidator that use the JSON output results together with standard UNIX tools. We demonstrate how GeneValidator's textual output can be used to filter and subset large gene sets effectively. First, we explain how low-scoring gene models can be identified and extracted for manual curation-for example, as input for genome browsers or gene annotation tools. Second, we show how GeneValidator's HTML report can be regenerated from a filtered subset of GeneValidator's JSON output. Subsequently, we demonstrate how GeneValidator's GUI can be used to complement manual curation efforts. Additionally, we explain how GeneValidator can be used to merge information from multiple annotations by automatically selecting the higher-scoring gene model at each common gene locus. Finally, we show how GeneValidator analyses can be optimized when using large BLAST databases.

[1]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[2]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[3]  Jens Keilwagen,et al.  Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi , 2017, BMC Bioinformatics.

[4]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[5]  Wei Shen,et al.  TaxonKit: a cross-platform and efficient NCBI taxonomy toolkit , 2019, bioRxiv.

[6]  L. Keller,et al.  The genome of the fire ant Solenopsis invicta , 2011, Proceedings of the National Academy of Sciences.

[7]  Vivek Rai,et al.  Sequenceserver: a modern graphical user interface for custom BLAST databases , 2015 .

[8]  Suzanna E Lewis,et al.  JBrowse: a dynamic web platform for genome visualization and analysis , 2016, Genome Biology.

[9]  Monica C Munoz-Torres,et al.  Web Apollo: a web-based genomic annotation editing platform , 2013, Genome Biology.

[10]  Ncbi National Center for Biotechnology Information , 2008 .

[11]  Mark Yandell,et al.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects , 2011, BMC Bioinformatics.

[12]  Ismail Moghul,et al.  GeneValidator: identify problems with protein-coding gene predictions , 2016, Bioinform..

[13]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[14]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[15]  Katharina J. Hoff,et al.  BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS , 2016, Bioinform..

[16]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[17]  Bernd Weisshaar,et al.  Exploiting single-molecule transcript sequencing for eukaryotic gene prediction , 2015, Genome Biology.