Bioinformatic approaches to identifying orthologs and assessing evolutionary relationships.

Non-human primate genetic research defines itself through comparisons to humans; few other species require the implicit comparative genomics approaches. Because of this, errors in the identification of non-human primate orthologs can have profound effects. Gene prediction algorithms can and have produced false transcripts that have become incorporated into commonly used databases and genomics portals. These false transcripts can arise from deficiencies in the algorithms themselves as well as through gaps and other problems in the genome assembly. Putative genes generated can not only miss microexons, but improperly incorporate non-coding sequence resulting in pseudogenes or other transcripts without biological relevance. False transcripts then become identified as orthologs to established human genes and are too often taken as gospel by unwary researchers. Here, the processes through which these errors propagate are isolated and methods are described for identifying false orthologs in databases with several representative errors illustrated. Through these steps any researcher seeking to make use of non-human primate genetic information will have the tools at their disposal to ascertain where errors exist and to remedy them once encountered.

[1]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[2]  Andreas Wagner,et al.  Rapid Detection of Positive Selection in Genes and Genomes Through Variation Clusters , 2007, Genetics.

[3]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[4]  P. Brûlet,et al.  Forebrain and midbrain regions are deleted in Otx2-/- mutants due to a defective anterior neuroectoderm specification during gastrulation. , 1995, Development.

[5]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[6]  Matthias Platzer,et al.  Should the draft chimpanzee sequence be finished? , 2006, Trends in genetics : TIG.

[7]  Gaston H. Gonnet,et al.  OMA Browser - Exploring orthologous relations across 352 complete genomes , 2007, Bioinform..

[8]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[9]  Anna Wetterbom,et al.  Genome-wide analysis of chimpanzee genes with premature termination codons , 2009, BMC Genomics.

[10]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[11]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[12]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[13]  R. Nielsen,et al.  Patterns of Positive Selection in Six Mammalian Genomes , 2008, PLoS genetics.

[14]  C. Ponting,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[15]  N. Copeland,et al.  Six3, a murine homologue of the sine oculis gene, demarcates the most anterior border of the developing neural plate and is expressed during eye development. , 1995, Development.

[16]  Dennis P Wall,et al.  Ortholog detection using the reciprocal smallest distance algorithm. , 2007, Methods in molecular biology.

[17]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[18]  Mei Li,et al.  MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences , 2003, Nucleic Acids Res..

[19]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[20]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[21]  David N. Messina,et al.  Evolutionary and Biomedical Insights from the Rhesus Macaque Genome , 2007, Science.

[22]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[23]  P. Bovolenta,et al.  Genomic cloning, structure, expression pattern, and chromosomal location of the human SIX3 gene. , 1999, Genomics.

[24]  Hedvig Tordai,et al.  Identification and correction of abnormal, incomplete and mispredicted proteins in public databases , 2008, BMC Bioinformatics.

[25]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[26]  Anna Wetterbom,et al.  Comparative Genomic Analysis of Human and Chimpanzee Indicates a Key Role for Indels in Primate Evolution , 2006, Journal of Molecular Evolution.

[27]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[28]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[29]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[30]  Ting Wang,et al.  The UCSC Genome Browser Database: update 2009 , 2008, Nucleic Acids Res..

[31]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[32]  Gabriel Moreno-Hagelsieb,et al.  Choosing BLAST options for better detection of orthologs as reciprocal best hits , 2008, Bioinform..

[33]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[34]  W. Li,et al.  Genomic divergence between human and chimpanzee estimated from large-scale alignments of genomic sequences. , 2001, The Journal of heredity.