Approaches to Fungal Genome Annotation

Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here, we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center's production genome annotation environment.

[1]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[2]  S. Grewal RNAi-dependent formation of heterochromatin and its diverse functions. , 2010, Current opinion in genetics & development.

[3]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[4]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[5]  Brandi L. Cantarel,et al.  The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics , 2008, Nucleic Acids Res..

[6]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[7]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[8]  G. Ast,et al.  Alternative splicing and evolution: diversification, exon definition and function , 2010, Nature Reviews Genetics.

[9]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[10]  T. Hunter,et al.  Evolution of protein kinase signaling from yeast to man. , 2002, Trends in biochemical sciences.

[11]  Roderic Guigó,et al.  Assembling Genes from Predicted Exons In Linear Time with Dynamic Programming , 1998, J. Comput. Biol..

[12]  Kristin L. Tangen,et al.  Genome Variation in Cryptococcus gattii, an Emerging Pathogen of Immunocompetent Hosts , 2011, mBio.

[13]  S. Stamm,et al.  Function of Alternative Splicing , 2004 .

[14]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[15]  Jan Barciszewski,et al.  Noncoding Rna Transcripts , 2002 .

[16]  Kim Rutherford,et al.  Artemis: sequence visualization and annotation , 2000, Bioinform..

[17]  Guojun Yang,et al.  MAK, a computational tool kit for automated MITE analysis , 2003, Nucleic Acids Res..

[18]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[19]  Pari Skamnioti,et al.  Genome Expansion and Gene Loss in Powdery Mildew Fungi Reveal Tradeoffs in Extreme Parasitism , 2010, Science.

[20]  James K. Hane,et al.  Dothideomycete–Plant Interactions Illuminated by Genome Sequencing and EST Analysis of the Wheat Pathogen Stagonospora nodorum[W][OA] , 2007, The Plant Cell Online.

[21]  Fabienne Thomarat,et al.  Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi , 2001, Nature.

[22]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[23]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[24]  O. Mühlemann,et al.  The meaning of nonsense. , 2008, Trends in cell biology.

[25]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[26]  H. Kazazian Mobile Elements: Drivers of Genome Evolution , 2004, Science.

[27]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[28]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[29]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[30]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[31]  J. C. Clemens,et al.  Drosophila Dscam Is an Axon Guidance Receptor Exhibiting Extraordinary Molecular Diversity , 2000, Cell.

[32]  Kimberly Van Auken,et al.  WormBase: better software, richer content , 2005, Nucleic Acids Res..

[33]  M. Brent,et al.  Iterative gene prediction and pseudogene removal improves genome annotation. , 2006, Genome research.

[34]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[35]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[36]  B. McDonald,et al.  Intraspecific comparison and annotation of two complete mitochondrial genome sequences from the plant pathogenic fungus Mycosphaerella graminicola. , 2007, Fungal genetics and biology : FG & B.

[37]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[38]  J. Galagan,et al.  Cross-kingdom patterns of alternative splicing and splice recognition , 2008, Genome Biology.

[39]  G. Sutton,et al.  Gene and alternative splicing annotation with AIR. , 2005, Genome research.

[40]  Nansheng Chen,et al.  Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences , 2009, Current protocols in bioinformatics.

[41]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[42]  Lin He,et al.  MicroRNAs: small RNAs with a big role in gene regulation , 2004, Nature reviews genetics.

[43]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[44]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[45]  D. Labie,et al.  Molecular Evolution , 1991, Nature.

[46]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[47]  R. Allshire,et al.  On the connection between RNAi and heterochromatin at centromeres. , 2010, Cold Spring Harbor symposia on quantitative biology.

[48]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[49]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[50]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[51]  Mario Stanke,et al.  Gene prediction with a hidden Markov model , 2004 .

[52]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[53]  Daniel G. Brown,et al.  Finding genes in Schistosoma japonicum: annotating novel genomes with help of extrinsic evidence , 2009, Nucleic acids research.

[54]  Eugene W. Myers,et al.  PILER : identification and classification of genomic repeats , 2005 .

[55]  Kenneth L. McNally,et al.  The complete DNA sequence of the mitochondrial genome of Podospora anserina , 1990, Current Genetics.

[56]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[57]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[58]  B. Haas,et al.  Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[59]  Q. Jin,et al.  Recent dermatophyte divergence revealed by comparative and phylogenetic analysis of mitochondrial genomes , 2009, BMC Genomics.

[60]  Luciano Digiampietri,et al.  The mitochondrial genome of the phytopathogenic basidiomycete Moniliophthora perniciosa is 109 kb in size and contains a stable integrated plasmid. , 2008, Mycological research.

[61]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[62]  J. Heitman,et al.  Magnificent seven: roles of G protein-coupled receptors in extracellular sensing in fungi. , 2008, FEMS microbiology reviews.

[63]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[64]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[65]  Jason E Stajich,et al.  Comparative genomic analyses of the human fungal pathogens Coccidioides and their relatives. , 2009, Genome research.

[66]  Jun Kong,et al.  MEROPS: the peptidase database. , 2004, Nucleic acids research.

[67]  B Franz Lang,et al.  Hyaloraphidium curvatum: a linear mitochondrial genome, tRNA editing, and an evolutionary link to lower fungi. , 2002, Molecular biology and evolution.

[68]  Christina A. Cuomo,et al.  Source (or Part of the following Source): Type Article Title Comparative Genomics Reveals Mobile Pathogenicity Chromosomes in Fusarium Author(s) , 2022 .

[69]  Yong-Hwan Lee,et al.  CFGP: a web-based, comparative fungal genomics platform , 2007, Nucleic Acids Res..

[70]  M. Borodovsky,et al.  Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. , 2008, Genome research.

[71]  F. Nóbrega,et al.  The mitochondrial genome from the thermal dimorphic fungus Paracoccidioides brasiliensis , 2007, Yeast.

[72]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[73]  Adam Eyre-Walker,et al.  Molecular Evolution by Wen-Hsiung Li. Published by Sinauer Associates, Sunderland, MA, USA. ISBN: 0-87893-463-4 (cloth). , 1997 .

[74]  J. Palmer,et al.  Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. , 2003, Molecular phylogenetics and evolution.

[75]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[76]  Richard D. Smith,et al.  Proteogenomics: needs and roles to be filled by proteomics in genome annotation. , 2008, Briefings in functional genomics & proteomics.

[77]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[78]  R. Durbin,et al.  GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. , 2002, Genome research.

[79]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[80]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[81]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[82]  Stephen M. Mount,et al.  Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis , 2006, BMC Genomics.

[83]  S. Stamm,et al.  Function of alternative splicing. , 2013, Gene.

[84]  J. Shapiro,et al.  Why repetitive DNA is essential to genome function , 2005, Biological reviews of the Cambridge Philosophical Society.

[85]  William H. Majoros,et al.  Efficient implementation of a generalized pair hidden Markov model for comparative gene finding , 2005, Bioinform..

[86]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[87]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[88]  M. Swami Small RNAs: A novel class , 2010, Nature Reviews Genetics.

[89]  Jonathan E. Allen,et al.  Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments , 2007, Genome Biology.

[90]  Pedro Beltrão,et al.  Comparative evolutionary genomics unveils the molecular mechanism of reassignment of the CTG codon in Candida spp. , 2003, Genome research.

[91]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[92]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[93]  B. Tuch,et al.  Computational and experimental approaches double the number of known introns in the pathogenic yeast Candida albicans. , 2007, Genome research.

[94]  A. Mighell,et al.  Vertebrate pseudogenes , 2000, FEBS letters.

[95]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[96]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[97]  E. Shelest,et al.  Transcription factors in fungi. , 2008, FEMS microbiology letters.

[98]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[99]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[100]  D. Haft,et al.  SMURF: Genomic mapping of fungal secondary metabolite clusters. , 2010, Fungal genetics and biology : FG & B.

[101]  V. Brendel,et al.  Genomewide comparative analysis of alternative splicing in plants. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[102]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[103]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[104]  Christopher D Town,et al.  Annotation of the Arabidopsis Genome1 , 2003, Plant Physiology.

[105]  Broňa Brejová,et al.  Evolution of linear chromosomes and multipartite genomes in yeast mitochondria , 2011, Nucleic acids research.

[106]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[107]  Daniel G. Brown,et al.  ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[108]  M. Wüthrich,et al.  Global Control of Dimorphism and Virulence in Fungi , 2006, Science.

[109]  Manuel A. S. Santos,et al.  Evolution of pathogenicity and sexual reproduction in eight Candida genomes , 2009, Nature.

[110]  Ashraf S. Ibrahim,et al.  Genomic Analysis of the Basal Lineage Fungus Rhizopus oryzae Reveals a Whole-Genome Duplication , 2009, PLoS genetics.

[111]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[112]  Thomas D. Otto,et al.  RATT: Rapid Annotation Transfer Tool , 2011, Nucleic acids research.

[113]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[114]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[115]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[116]  J. J. Coleman,et al.  Efflux in Fungi: La Pièce de Résistance , 2009, PLoS pathogens.

[117]  Michael Ashburner,et al.  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review , 2002, Genome Biology.

[118]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[119]  P. D. de Wit,et al.  Fungal effector proteins. , 2009, Annual review of phytopathology.

[120]  B. Haas,et al.  Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release , 2005, BMC Biology.

[121]  Eduardo Eyras,et al.  ESTGenes: alternative splicing from ESTs in Ensembl. , 2004, Genome research.

[122]  Tristan Rossignol,et al.  CandidaDB: a multi-genome database for Candida species and related Saccharomycotina , 2007, Nucleic Acids Res..

[123]  M. Wüthrich,et al.  Detection and measurement of two-component systems that control dimorphism and virulence in fungi. , 2007, Methods in enzymology.

[124]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[125]  M. Gerstein,et al.  Large-scale analysis of pseudogenes in the human genome. , 2004, Current opinion in genetics & development.

[126]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[127]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[128]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[129]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.

[130]  O. Mühlemann,et al.  Cutting the nonsense: the degradation of PTC-containing mRNAs. , 2010, Biochemical Society transactions.

[131]  Crispin J. Miller,et al.  Augmented Annotation of the Schizosaccharomyces pombe Genome Reveals Additional Genes Required for Growth and Viability , 2011, Genetics.

[132]  Zissimos Mourelatos,et al.  The microRNA world: small is mighty. , 2003, Trends in biochemical sciences.

[133]  Mark Gerstein,et al.  PseudoPipe: an automated pseudogene identification pipeline , 2006, Bioinform..

[134]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[135]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[136]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[137]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[138]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[139]  J. Do,et al.  Computational approaches to gene prediction. , 2006, Journal of microbiology.

[140]  Kwan-Hwa Park,et al.  Carbohydrate-active enzymes , 2008 .

[141]  Hideaki Sugawara,et al.  DDBJ in collaboration with mass-sequencing teams on annotation , 2004, Nucleic Acids Res..

[142]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[143]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[144]  Q. Zeng,et al.  Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus) , 2010, Proceedings of the National Academy of Sciences.

[145]  Martha B. Arnaud,et al.  The Candida Genome Database: facilitating research on Candida albicans molecular biology. , 2006, FEMS yeast research.

[146]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[147]  Samson S. Y. Wong,et al.  The mitochondrial genome of the thermal dimorphic fungus Penicillium marneffei is more closely related to those of molds than yeasts , 2003, FEBS letters.

[148]  W. Marande,et al.  Systematically fragmented genes in a multipartite mitochondrial genome , 2010, Nucleic acids research.

[149]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[150]  A. J. Schroeder,et al.  Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. , 2007, Genome research.

[151]  M. Nei,et al.  Pseudogenes as a paradigm of neutral evolution , 1981, Nature.

[152]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[153]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[154]  Pier Luigi Martelli,et al.  PredGPI: a GPI-anchor predictor , 2008, BMC Bioinformatics.

[155]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[156]  Christina A Cuomo,et al.  The fungal genome initiative and lessons learned from genome sequencing. , 2010, Methods in enzymology.

[157]  Vincent Lombard,et al.  The EMBL Nucleotide Sequence Database , 2005, Nucleic Acids Res..

[158]  Manolis Kellis,et al.  Comparative Functional Genomics of the Fission Yeasts , 2011, Science.

[159]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[160]  E. Lerat Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs , 2010, Heredity.

[161]  A. Malhotra,et al.  A novel class of small RNAs: tRNA-derived RNA fragments (tRFs). , 2009, Genes & development.

[162]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[163]  E. Birney,et al.  Apollo: a sequence annotation editor , 2002, Genome Biology.

[164]  C. Vágvölgyi,et al.  Comparative analysis of the complete mitochondrial genomes of Aspergillus niger mtDNA type 1a and Aspergillus tubingensis mtDNA type 2b. , 2008, FEMS microbiology letters.

[165]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..