Genome Annotation in Plants and Fungi: EuGene as a Model Platform

In this era of whole genome sequencing, reliable genome annotations (identification of functional regions) are the cornerstones for many subsequent analyses. Not only is careful annotation important for studying the gene and gene family content of a genome and its host, but also for wide-scale transcriptome and proteome analyses attempting to de- scribe a certain biological process or to get a global picture of a cell's behavior. Although the number of sequenced ge- nomes is increasing thanks to the application of new technologies, genome-wide analyses will critically depend on the quality of the genome annotations. However, the annotation process is more complicated in the plant field than in the animal field because of the limited funding that leads to much fewer experimental data and less annotation expertise. This situation calls for highly automated annotation platforms that can make the best use of all available data, experimental or not. We discuss how the gene prediction (the process of predicting protein gene structures in genomic sequences) research field increasingly shifts from methods that typically exploited one or two types of data to more integrative approaches that simultaneously deal with various experimental, statistical, or other in silico evidence. We illustrate the importance of inte- grative approaches for producing high-quality automatic annotations of genomes of plants and algae as well as of fungi that live in close association with plants using the platform EuGene as an example.

[1]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[2]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[3]  Thomas Schiex,et al.  Integrating alternative splicing detection into gene prediction , 2005, BMC Bioinformatics.

[4]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[5]  V. Brendel,et al.  GeneSeqer@PlantGDB: Gene structure prediction in plant genomes. , 2003, Nucleic acids research.

[6]  Samuel S. Gross,et al.  Begin at the beginning: predicting genes with 5' UTRs. , 2005, Genome research.

[7]  Hank C Wu,et al.  Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome , 2007, BMC Genomics.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  M. Martin-Magniette,et al.  Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome , 2007, BMC Genomics.

[10]  Nicholas H. Putnam,et al.  The tiny eukaryote Ostreococcus provides genomic insights into the paradox of plankton speciation , 2007, Proceedings of the National Academy of Sciences.

[11]  Thomas Schiex,et al.  EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence , 2000, JOBIM.

[12]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[13]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[14]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[15]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[16]  M. Borodovsky,et al.  Detection of new genes in a bacterial genome using Markov models for three gene classes. , 1995, Nucleic acids research.

[17]  Burkhard Morgenstern,et al.  AUGUSTUS: a web server for gene finding in eukaryotes , 2004, Nucleic Acids Res..

[18]  Robert E. Kalaba,et al.  Dynamic Programming and Modern Control Theory , 1966 .

[19]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[20]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[21]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[22]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[23]  E. Birney Ensembl: a genome infrastructure. , 2003, Cold Spring Harbor symposia on quantitative biology.

[24]  R. Durbin,et al.  GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. , 2002, Genome research.

[25]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[26]  M. Borodovsky,et al.  Gene identification in novel eukaryotic genomes by self-training algorithm , 2005, Nucleic acids research.

[27]  Gordon Gremme,et al.  Engineering a software tool for gene structure prediction in higher organisms , 2005, Inf. Softw. Technol..

[28]  Thomas Schiex,et al.  EUGÈNE'HOM: a generic similarity-based gene finder using multiple homologous sequences , 2003, Nucleic Acids Res..

[29]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[30]  Zhe Li,et al.  Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice , 2006, BMC Bioinformatics.

[31]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[32]  S. Brunak,et al.  Cleaning the GenBank Arabidopsis thaliana data set. , 1996, Nucleic acids research.

[33]  O. Jaillon,et al.  Exploring root symbiotic programs in the model legume Medicago truncatula using EST analysis. , 2002, Nucleic acids research.

[34]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.

[35]  V. Brendel,et al.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. , 1996, Nucleic acids research.

[36]  Ramana V. Davuluri,et al.  Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences , 1999, Bioinform..

[37]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[38]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[39]  K. Akiyama,et al.  Functional Annotation of a Full-Length Arabidopsis cDNA Collection , 2002, Science.

[40]  F. Legeai,et al.  Predotar: A tool for rapidly screening proteomes for N‐terminal targeting sequences , 2004, Proteomics.

[41]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[42]  B. Morgenstern,et al.  AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome , 2006, Genome Biology.

[43]  B. De Baets,et al.  Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Steven Salzberg,et al.  An empirical analysis of training protocols for probabilistic gene finders , 2005, BMC Bioinformatics.

[45]  Pierre Rouzé,et al.  CATMA: a complete Arabidopsis GST database , 2003, Nucleic Acids Res..