Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human

Accurate predictions of orthology and paralogy relationships are necessary to infer human molecular function from experiments in model organisms. Previous genome-scale approaches to predicting these relationships have been limited by their use of protein similarity and their failure to take into account multiple splicing events and gene prediction errors. We have developed PhyOP, a new phylogenetic orthology prediction pipeline based on synonymous rate estimates, which accurately predicts orthology and paralogy relationships for transcripts, genes, exons, or genomic segments between closely related genomes. We were able to identify orthologue relationships to human genes for 93% of all dog genes from Ensembl. Among 1:1 orthologues, the alignments covered a median of 97.4% of protein sequences, and 92% of orthologues shared essentially identical gene structures. PhyOP accurately recapitulated genomic maps of conserved synteny. Benchmarking against predictions from Ensembl and Inparanoid showed that PhyOP is more accurate, especially in its predictions of paralogy. Nearly half (46%) of PhyOP paralogy predictions are unique. Using PhyOP to investigate orthologues and paralogues in the human and dog genomes, we found that the human assembly contains 3-fold more gene duplications than the dog. Species-specific duplicate genes, or “in-paralogues,” are generally shorter and have fewer exons than 1:1 orthologues, which is consistent with selective constraints and mutation biases based on the sizes of duplicated genes. In-paralogues have experienced elevated amino acid and synonymous nucleotide substitution rates. Duplicates possess similar biological functions for either the dog or human lineages. Having accounted for 2,954 likely pseudogenes and gene fragments, and after separating 346 erroneously merged genes, we estimated that the human genome encodes a minimum of 19,700 protein-coding genes, similar to the gene count of nematode worms. PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species. PhyOP will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes.

[1]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[2]  L. Hurst,et al.  Similar rates but different modes of sequence evolution in introns and at exonic silent sites in rodents: evidence for selectively driven codon usage. , 2004, Molecular biology and evolution.

[3]  Jacek Majewski,et al.  Dependence of mutational asymmetry on gene-expression levels in the human genome. , 2003, American journal of human genetics.

[4]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[5]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[6]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[7]  W. G. Hill,et al.  The effect of linkage on limits to artificial selection. , 1966, Genetical research.

[8]  E. Eichler,et al.  An Alu transposition model for the origin and expansion of human segmental duplications. , 2003, American journal of human genetics.

[9]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[12]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[13]  R. Durbin,et al.  The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics , 2003, PLoS biology.

[14]  L. Duret,et al.  Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. , 2000, Molecular biology and evolution.

[15]  P. Green,et al.  Transcription-associated mutational asymmetry in mammalian evolution , 2003, Nature Genetics.

[16]  N. Macmichael,et al.  Notes , 1947, Edinburgh Medical Journal.

[17]  N. Galtier Gene conversion drives GC content evolution in mammalian histones. , 2003, Trends in genetics : TIG.

[18]  Martin J Lercher,et al.  Gene expression, synteny, and local similarity in human noncoding mutation rates. , 2004, Molecular biology and evolution.

[19]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[20]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[21]  M. Lynch,et al.  The Origins of Genome Complexity , 2003, Science.

[22]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[23]  P. Keightley,et al.  Deleterious mutations and the evolution of sex. , 2000, Science.

[24]  S. O’Brien,et al.  Extensive conservation of sex chromosome organization between cat and human revealed by parallel radiation hybrid mapping. , 1999, Genome research.

[25]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[26]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[27]  M. Nei Molecular Evolutionary Genetics , 1987 .

[28]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[29]  Richard D Emes,et al.  Comparison of the genomes of human and mouse lays the foundation of genome zoology. , 2003, Human molecular genetics.

[30]  Eugene V Koonin,et al.  Duplicated genes evolve slower than singletons despite the initial rate increase , 2004, BMC Evolutionary Biology.

[31]  Z. Yang,et al.  Substitution rates in Drosophila nuclear genes: implications for translational selection. , 2001, Genetics.

[32]  G. Mahairas,et al.  A 1-Mb resolution radiation hybrid map of the canine genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  R. Sorek,et al.  Transcription-mediated gene fusion in the human genome. , 2005, Genome research.

[34]  Peer Bork,et al.  Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. , 2005, Genome research.

[35]  T. Nagylaki,et al.  The evolution of multigene families under intrachromosomal gene conversion. , 1984, Genetics.

[36]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[37]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[38]  François-Joseph Lapointe,et al.  A weighted least-squares approach for inferring phylogenies from incomplete distance matrices , 2004, Bioinform..

[39]  David Haussler,et al.  Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. , 2003, Genome research.

[40]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[41]  Jean L. Chang,et al.  An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[43]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[44]  Mark Shannon,et al.  Differential expansion of zinc-finger transcription factor loci in homologous human and mouse gene clusters. , 2003, Genome research.

[45]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[46]  L. Skow,et al.  Conservation of gene order between horse and human X chromosomes as evidenced through radiation hybrid mapping. , 2003, Genomics.

[47]  B. Frey,et al.  Alternative splicing of conserved exons is frequently species-specific in human and mouse. , 2005, Trends in genetics : TIG.

[48]  R. Myers,et al.  Quality assessment of the human genome sequence , 2004, Nature.

[49]  N. Carter,et al.  Reciprocal chromosome painting reveals detailed regions of conserved synteny between the karyotypes of the domestic dog (Canis familiaris) and human. , 1999, Genomics.

[50]  Caleb Webber,et al.  Bias of Selection on Human Copy-Number Variants , 2006, PLoS genetics.

[51]  Mark Gerstein,et al.  Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. , 2003, Genome research.

[52]  Wen-Hsiung Li,et al.  Patterns of segmental duplication in the human genome. , 2004, Molecular biology and evolution.

[53]  Tim Hubbard Finishing the euchromatic sequence of the human genome , 2004 .

[54]  Alistair G. Rust,et al.  Ensembl 2002: accommodating comparative genomics , 2003, Nucleic Acids Res..

[55]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[56]  Tomaso Poggio,et al.  Identification and analysis of alternative splicing events conserved in human and mouse. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Shoko Tanabe,et al.  An insertion/deletion TEX28 polymorphism and its application to analysis of red/green visual pigment gene arrays , 2004, Journal of Human Genetics.

[58]  L. Duret,et al.  Nature and structure of human genes that generate retropseudogenes. , 2000, Genome research.

[59]  M. Lynch,et al.  The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. , 2003, Genetics.

[60]  Leonard Lipovich,et al.  Abundant novel transcriptional units and unconventional gene pairs on human chromosome 22. , 2005, Genome research.

[61]  A. Reymond,et al.  Tandem chimerism as a means to increase protein complexity in the human genome. , 2005, Genome research.

[62]  W Bains,et al.  Local sequence dependence of rate of base replacement in mammals. , 1992, Mutation research.

[63]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[64]  Chris P. Ponting,et al.  Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model , 2005, PLoS Comput. Biol..

[65]  D. Haussler,et al.  Hotspots of mammalian chromosomal evolution , 2004, Genome Biology.

[66]  E. Koonin,et al.  Selection in the evolution of gene duplications , 2002, Genome Biology.

[67]  Aleksey Y Ogurtsov,et al.  Selection in favor of nucleotides G and C diversifies evolution rates and levels of polymorphism at mammalian synonymous sites. , 2006, Journal of theoretical biology.

[68]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.