Performance and Scalability of Discriminative Metrics for Comparative Gene Identification in 12 Drosophila Genomes

Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (≤240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.

[1]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[2]  Dalong Ma,et al.  Nested genes in the human genome. , 2005, Genomics.

[3]  Yvan Saeys,et al.  In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists , 2007, Bioinform..

[4]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[5]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[6]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[9]  Peer Bork,et al.  Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster , 2002, Science.

[10]  T. Markow,et al.  Drosophila Biology in the Genomic Age , 2007, Genetics.

[11]  Michael Ashburner,et al.  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review , 2002, Genome Biology.

[12]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[13]  Bonnie Berger,et al.  Methods in comparative genomics: genome correspondence, gene identification and motif discovery , 2003 .

[14]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[15]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[16]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[17]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[18]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[19]  A. Gnirke,et al.  Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome , 2002, Genome Biology.

[20]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[21]  S. Eddy A Model of the Statistical Power of Comparative Genome Sequence Analysis , 2005, PLoS biology.

[22]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[23]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[24]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[25]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[26]  J. L. Cherry,et al.  Should we expect substitution rate to depend on population size? , 1998, Genetics.

[27]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[28]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[29]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[30]  Sumio Sugano,et al.  Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. , 2004, Genome research.

[31]  Juan Pablo Couso,et al.  Peptides Encoded by Short ORFs Control Development and Define a New Eukaryotic Gene Family , 2007, PLoS biology.

[32]  Jun Kawai,et al.  The Abundance of Short Proteins in the Mammalian Proteome , 2006, PLoS genetics.

[33]  Matthew D. Rasmussen,et al.  Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. , 2007, Genome research.

[34]  C. Bult,et al.  Discrimination of Non-Protein-Coding Transcripts from Protein-Coding mRNA , 2006, RNA biology.

[35]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[36]  Michael R Brent,et al.  Genome annotation past, present, and future: how to define an ORF at each locus. , 2005, Genome research.

[37]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[38]  J. Reichhart,et al.  Tip of another iceberg: Drosophila serpins. , 2005, Trends in cell biology.

[39]  Graziano Pesole,et al.  Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. , 2003, Nucleic acids research.

[40]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[41]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[42]  A. J. Schroeder,et al.  Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. , 2007, Genome research.

[43]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[44]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[45]  Sachi Inagaki,et al.  Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA , 2007, Nature Cell Biology.

[46]  Inna Dubchak,et al.  Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. , 2005, Genome research.

[47]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[48]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[49]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[50]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[51]  Samuel Karlin,et al.  Associations between human disease genes and overlapping gene groups and multiple amino acid runs , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Wen-Hsiung Li,et al.  The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. , 2002, Genome research.

[53]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[54]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[55]  Feng Gao,et al.  Comparison of various algorithms for recognizing short coding sequences of human genes , 2004, Bioinform..

[56]  Jean L. Chang,et al.  An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[58]  Mathieu Blanchette,et al.  Computation and analysis of genomic multi-sequence alignments. , 2007, Annual review of genomics and human genetics.

[59]  S. Kasif,et al.  Human-mouse gene identification by comparative evidence integration and evolutionary analysis. , 2003, Genome research.

[60]  J H Gillespie,et al.  The role of population size in molecular evolution. , 1999, Theoretical population biology.

[61]  Ying Wang,et al.  Insights into social insects from the genome of the honeybee Apis mellifera , 2006, Nature.

[62]  Bonnie Berger,et al.  Methods in Comparative Genomics: Genome Correspondence, Gene Identification and Regulatory Motif Discovery , 2004, J. Comput. Biol..

[63]  Thomas Hofmann,et al.  Comparative Gene Prediction using Conditional Random Fields , 2007 .

[64]  H. Akashi,et al.  Gene expression and molecular evolution. , 2001, Current opinion in genetics & development.

[65]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[66]  B. Rost,et al.  Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines , 2006, PLoS genetics.

[67]  F. Jiggins,et al.  A screen for immunity genes evolving under positive selection in Drosophila , 2007, Journal of evolutionary biology.

[68]  Nancy F. Hansen,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[69]  C. Fizames,et al.  Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence , 2000, Nature Genetics.

[70]  G M Rubin,et al.  A Drosophila complementary DNA resource. , 2000, Science.

[71]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[72]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[73]  Z. Yang,et al.  Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. , 2000, Molecular biology and evolution.

[74]  G. Olsen,et al.  CRITICA: coding region identification tool invoking comparative analysis. , 1999, Molecular biology and evolution.

[75]  M. Nei,et al.  Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. , 1986, Molecular biology and evolution.

[76]  Trevor F. Cox,et al.  Metric multidimensional scaling , 2000 .

[78]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[79]  Ziheng Yang,et al.  Statistical methods for detecting molecular adaptation , 2000, Trends in Ecology & Evolution.

[80]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[81]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[82]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[83]  Piero Carninci,et al.  The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. , 2002, Genome research.

[84]  G. Rubin,et al.  A Drosophila full-length cDNA resource , 2002, Genome Biology.

[85]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[86]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..