Compositional Features of Eukaryotic Genomes for Checking Predicted Genes

Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.

[1]  G Bernardi,et al.  The major components of the mouse and human genomes. 2. Reassociation kinetics. , 1981, European journal of biochemistry.

[2]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica) , 2002, Science.

[3]  G Bernardi,et al.  Human coding and noncoding DNA: compositional correlations. , 1996, Molecular phylogenetics and evolution.

[4]  G. Bernardi,et al.  Single-copy sequence homology among the GC-richest isochores of the genomes from warm-blooded vertebrates , 1994, Journal of Molecular Evolution.

[5]  G Bernardi,et al.  An analysis of eukaryotic genomes by density gradient centrifugation. , 1976, Journal of molecular biology.

[6]  Giorgio Bernardi,et al.  Correlations between the compositional properties of human genes, codon usage, and amino acid composition of proteins , 1991, Journal of Molecular Evolution.

[7]  G. Bernardi,et al.  Compositional Correlations in the Chicken Genome , 1999, Journal of Molecular Evolution.

[8]  G. Bernardi,et al.  Isochore conservation between MHC regions on human chromosome 6 and mouse chromosome 17 , 2002, FEBS letters.

[9]  G Bernardi,et al.  The correlation of protein hydropathy with the base composition of coding sequences. , 1999, Gene.

[10]  G. Bernardi,et al.  Compositional constraints and genome evolution , 2005, Journal of Molecular Evolution.

[11]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[12]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[13]  M. Hattori,et al.  The human genome: Part three in the book of genes , 2001, Nature.

[14]  G Bernardi,et al.  The mosaic genome of warm-blooded vertebrates. , 1985, Science.

[15]  G. Bernardi,et al.  Identification of the Gene-Richest Bands in Human Prometaphase Chromosomes , 2004, Chromosome Research.

[16]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[17]  G Bernardi,et al.  The compositional evolution of vertebrate genomes. , 2000, Gene.

[18]  G. Bernardi,et al.  Gene density in the Giemsa bands of human chromosomes , 2004, Chromosome Research.

[19]  G. Bernardi,et al.  The distribution of genes in the Drosophila genome. , 2000, Gene.

[20]  P. Jolicoeur Bivariate allometry: Interval estimation of the slopes of the ordinary and standardized normal major axes and structural relationship , 1990 .

[21]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[22]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[23]  G Bernardi,et al.  The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity. , 1981, European journal of biochemistry.

[24]  N. Sueoka Directional mutation pressure and neutral molecular evolution. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[25]  T. Gojobori,et al.  The genome sequence and structure of rice chromosome 1 , 2002, Nature.

[26]  G. Bernardi,et al.  The human genome: organization and evolutionary history. , 1995, Annual review of genetics.

[27]  G. Bernardi,et al.  Compositional Mapping of Mouse Chromosomes and Identification of the Gene-Rich Regions , 1997, Chromosome Research.

[28]  G Bernardi,et al.  CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families. , 1998, Gene.

[29]  G Bernardi,et al.  An analysis of the bovine genome by Cs2SO4-Ag density gradient centrifugation. , 1973, Journal of molecular biology.

[30]  G. Bernardi,et al.  The compositional patterns of the avian genomes and their evolutionary implications , 1993, Journal of Molecular Evolution.

[31]  D. Haussler,et al.  Integration of cytogenetic landmarks into the draft sequence of the human genome , 2001, Nature.

[32]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[33]  Giorgio Bernardi,et al.  Localization of the gene-richest and the gene-poorest isochores in the interphase nuclei of mammals and birds. , 2002, Gene.

[34]  D. Penny The comparative method in evolutionary biology , 1992 .

[35]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[36]  G. Bernardi,et al.  The gene-richest bands of human chromosomes replicate at the onset of the S-phase , 1998, Cytogenetic and Genome Research.

[37]  N. Sueoka Directional mutation pressure, selective constraints, and genetic equilibria , 1992, Journal of Molecular Evolution.

[38]  G Bernardi,et al.  The gene distribution of the human genome. , 1996, Gene.

[39]  Pak Chung Sham,et al.  Analytic approaches to twin data using structural equation models , 2002, Briefings Bioinform..

[40]  G. Bernardi,et al.  The compositional properties of human genes , 1991, Journal of Molecular Evolution.

[41]  G Bernardi,et al.  Compositional heterogeneity within and among isochores in mammalian genomes. II. Some general comments. , 2001, Gene.

[42]  G. Bernardi,et al.  Incorrectly predicted genes in rice? , 2004, Gene.

[43]  H. Ochman,et al.  Molecular archaeology of the Escherichia coli genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[44]  G Bernardi,et al.  A universal compositional correlation among codon positions. , 1992, Gene.

[45]  G Bernardi,et al.  The distribution of genes in the human genome. , 1991, Gene.

[46]  L. Duret,et al.  Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores , 1995, Journal of Molecular Evolution.

[47]  P. Churchland On the nature of theories: a neurocomputational perspective , 1990 .

[48]  E. Eichler,et al.  Segmental duplications: what's missing, misassigned, and misassembled--and should we care? , 2001, Genome research.

[49]  G. Bernardi,et al.  Genes, isochores and bands in human chromosomes 21 and 22 , 2004, Chromosome Research.

[50]  G Bernardi,et al.  Compositional heterogeneity within and among isochores in mammalian genomes. I. CsCl and sequence analyses. , 2001, Gene.

[51]  Jan Paces,et al.  A compact view of isochores in the draft human genome sequence , 2002, FEBS letters.

[52]  Takashi Matsumoto,et al.  RiceGAAS: an automated annotation system and database for rice genome sequence , 2002, Nucleic Acids Res..

[53]  G. Bernardi,et al.  Compositional mapping of chicken chromosomes and identification of the gene-richest regions , 2001, Chromosome Research.

[54]  G Bernardi,et al.  Misunderstandings about isochores. Part 1. , 2001, Gene.

[55]  G. Bernardi,et al.  Diversity and phylogenetic implications of CsCl profiles from rodent DNAs. , 2000, Molecular phylogenetics and evolution.