OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches

Abstract Motivation Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Results Here, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. Availabilityand implementation OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  J. Poulain,et al.  The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla , 2007, Nature.

[2]  Davide Heller,et al.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses , 2018, Nucleic Acids Res..

[3]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[4]  Kenneth H. Wolfe,et al.  Turning a hobby into a job: How duplicated genes find new functions , 2008, Nature Reviews Genetics.

[5]  Eugene V. Koonin,et al.  A Tight Link between Orthologs and Bidirectional Best Hits in Bacterial and Archaeal Genomes , 2012, Genome biology and evolution.

[6]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[7]  W. Jetz,et al.  Inferring the mammal tree: Species-level sets of phylogenies for questions in ecology, evolution, and conservation , 2019, PLoS biology.

[8]  Marie Sémon,et al.  Consequences of genome duplication. , 2007, Current opinion in genetics & development.

[9]  Erik L. L. Sonnhammer,et al.  InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic , 2014, Nucleic Acids Res..

[10]  F. Hoffmann,et al.  Differential loss of embryonic globin genes during the radiation of placental mammals , 2008, Proceedings of the National Academy of Sciences.

[11]  M. Zytnicki,et al.  Genome expansion of Arabis alpina linked with retrotransposition and reduced symmetric DNA methylation , 2015, Nature Plants.

[12]  Alex Bateman,et al.  TreeFam v9: a new website, more species and orthology-on-the-fly , 2013, Nucleic Acids Res..

[13]  Kevin J. Peterson,et al.  The phylogeny, evolutionary developmental biology, and paleobiology of the Deuterostomia: 25 years of new techniques, new discoveries, and new ideas , 2016, Organisms Diversity & Evolution.

[14]  Shengfeng Huang,et al.  HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly , 2017, Bioinform..

[15]  Anushya Muruganujan,et al.  PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools , 2018, Nucleic Acids Res..

[16]  Yan Wang,et al.  Advances and Applications in the Quest for Orthologs , 2019, Molecular biology and evolution.

[17]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[18]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[19]  Benjamin Linard,et al.  Rapid alignment-free phylogenetic identification of metagenomic sequences , 2018, bioRxiv.

[20]  Gaston H. Gonnet,et al.  The Impact of Gene Duplication, Insertion, Deletion, Lateral Gene Transfer and Sequencing Error on Orthology Inference: A Simulation Study , 2013, PloS one.

[21]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[22]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[25]  Susan J. Brown,et al.  The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. , 2013, The Journal of heredity.

[26]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[27]  Nicholas H. Putnam,et al.  The amphioxus genome and the evolution of the chordate karyotype , 2008, Nature.

[28]  Simon C Lovell,et al.  Rapid functional and evolutionary changes follow gene duplication in yeast , 2017, Proceedings of the Royal Society B: Biological Sciences.

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  Amborella Genome The Amborella Genome and the Evolution of Flowering Plants , 2013, Science.

[31]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[32]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[33]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[34]  Michael G. Nute,et al.  HIPPI: highly accurate protein family classification with ensembles of HMMs , 2016, BMC Genomics.

[35]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[36]  Ingo Ebersberger,et al.  HaMStR: Profile hidden markov model based search for orthologs in ESTs , 2009, BMC Evolutionary Biology.

[37]  Yuji Kohara,et al.  Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions , 2019, Nature Communications.

[38]  Nicolas Bailly,et al.  Phylogenetic classification of bony fishes , 2017, BMC Evolutionary Biology.

[39]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[40]  Luis Pedro Coelho,et al.  Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper , 2016, bioRxiv.

[41]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019, Genome Biology.

[42]  Evgeny M. Zdobnov,et al.  OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs , 2018, Nucleic Acids Res..

[43]  M. Meselson,et al.  Massive Horizontal Gene Transfer in Bdelloid Rotifers , 2008, Science.

[44]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[45]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[46]  Robert D. Finn,et al.  TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations , 2018, Bioinform..

[47]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.