OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Here, we first show that in multiple animal datasets, 19 to 68% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using subfamily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.

[1]  Nicolas Bailly,et al.  Phylogenetic classification of bony fishes , 2017, BMC Evolutionary Biology.

[2]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[3]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[4]  Gaston H. Gonnet,et al.  Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs , 2013, PloS one.

[5]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[6]  Shengfeng Huang,et al.  HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly , 2017, Bioinform..

[7]  R. Guigó,et al.  Comparative transcriptomics in human and mouse , 2017, Nature Reviews Genetics.

[8]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[9]  M. Robinson‐Rechavi,et al.  What to compare and how: Comparative transcriptomics for Evo‐Devo , 2015, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[10]  Yan Wang,et al.  Advances and Applications in the Quest for Orthologs , 2019, Molecular biology and evolution.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Evgeny M. Zdobnov,et al.  OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs , 2018, Nucleic Acids Res..

[13]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[14]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[15]  Simon C Lovell,et al.  Rapid functional and evolutionary changes follow gene duplication in yeast , 2017, Proceedings of the Royal Society B: Biological Sciences.

[16]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[17]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[18]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[19]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Gaston H. Gonnet,et al.  The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces , 2017, Nucleic Acids Res..

[22]  Davide Heller,et al.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses , 2018, Nucleic Acids Res..

[23]  C. Dessimoz,et al.  Orthology: definitions, inference, and impact on species phylogeny inference , 2019, 1903.04530.

[24]  Benjamin Linard,et al.  Rapid alignment-free phylogenetic identification of metagenomic sequences , 2018, bioRxiv.

[25]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[26]  Marie Sémon,et al.  Consequences of genome duplication. , 2007, Current opinion in genetics & development.

[27]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019, Genome Biology.

[28]  Ingo Ebersberger,et al.  HaMStR: Profile hidden markov model based search for orthologs in ESTs , 2009, BMC Evolutionary Biology.

[29]  W. Jetz,et al.  Inferring the mammal tree: Species-level sets of phylogenies for questions in ecology, evolution, and conservation , 2019, PLoS biology.

[30]  R. Lowdon,et al.  Evolution of Epigenetic Regulation in Vertebrate Genomes. , 2016, Trends in genetics : TIG.

[31]  Kenneth H. Wolfe,et al.  Turning a hobby into a job: How duplicated genes find new functions , 2008, Nature Reviews Genetics.

[32]  Susan J. Brown,et al.  The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. , 2013, The Journal of heredity.

[33]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[34]  Kevin J. Peterson,et al.  The phylogeny, evolutionary developmental biology, and paleobiology of the Deuterostomia: 25 years of new techniques, new discoveries, and new ideas , 2016, Organisms Diversity & Evolution.

[35]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[36]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[37]  Michael G. Nute,et al.  HIPPI: highly accurate protein family classification with ensembles of HMMs , 2016, BMC Genomics.

[38]  F. Hoffmann,et al.  Differential loss of embryonic globin genes during the radiation of placental mammals , 2008, Proceedings of the National Academy of Sciences.

[39]  Daniel N. Baker,et al.  KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[40]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[41]  Anushya Muruganujan,et al.  PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools , 2018, Nucleic Acids Res..

[42]  Luis Pedro Coelho,et al.  Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper , 2016, bioRxiv.

[43]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[44]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[45]  Nicholas H. Putnam,et al.  The amphioxus genome and the evolution of the chordate karyotype , 2008, Nature.

[46]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[47]  Robert D. Finn,et al.  TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations , 2018, Bioinform..

[48]  Alex Bateman,et al.  TreeFam v9: a new website, more species and orthology-on-the-fly , 2013, Nucleic Acids Res..

[49]  Yuji Kohara,et al.  Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions , 2019, Nature Communications.