Efficient inference of homologs in large eukaryotic pan-proteomes

BackgroundIdentification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.ResultsTo address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.ConclusionsWe clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.

[1]  Salvador Capella-Gutiérrez,et al.  PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome , 2013, Nucleic Acids Res..

[2]  Gaston H. Gonnet,et al.  Algorithm of OMA for large-scale orthology inference , 2008, BMC Bioinformatics.

[3]  M. Ruggero,et al.  Similarity of Traveling-Wave Delays in the Hearing Organs of Humans and Other Tetrapods , 2007, Journal for the Association for Research in Otolaryngology.

[4]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[5]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[6]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[7]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[8]  Roger J Hajjar,et al.  Damped elastic recoil of the titin spring in myofibrils of human myocardium , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Xun Xu,et al.  The Tarenaya hassleriana Genome Provides Insight into Reproductive Trait and Genome Evolution of Crucifers[W][OPEN] , 2013, Plant Cell.

[10]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[11]  Dick de Ridder,et al.  PanTools: representation, storage and exploration of pan-genomic data , 2016, Bioinform..

[12]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[13]  Caixia Wang,et al.  Predicting overlapping protein complexes based on core-attachment and a local modularity structure , 2018, BMC Bioinformatics.

[14]  Damian Szklarczyk,et al.  eggNOG v4.0: nested orthology inference across 3686 organisms , 2013, Nucleic Acids Res..

[15]  F. Tekaia Inferring Orthologs: Open Questions and Perspectives , 2016, Genomics insights.

[16]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[17]  Evgeny M. Zdobnov,et al.  OrthoDB: the hierarchical catalog of eukaryotic orthologs , 2007, Nucleic Acids Res..

[18]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[19]  Richard A Neher,et al.  panX: pan-genome analysis and exploration , 2016, bioRxiv.

[20]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[21]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[22]  Vipin T. Sreedharan,et al.  Multiple reference genomes and transcriptomes for Arabidopsis thaliana , 2011, Nature.

[23]  Daniel A. Skelly,et al.  The 100-genomes strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen , 2015, Genome research.

[24]  J. Hirst,et al.  Structure of mammalian respiratory complex I , 2016, Nature.

[25]  Tae-Ho Lee,et al.  PGDD: a database of gene and genome duplication in plants , 2012, Nucleic Acids Res..

[26]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[27]  P. Bork,et al.  Orthology prediction methods: A quality assessment using curated protein families , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[28]  Evgeny M. Zdobnov,et al.  OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs , 2016, Nucleic Acids Res..

[29]  S. Kelly,et al.  OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy , 2015, Genome Biology.

[30]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[31]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[32]  Evgeny M. Zdobnov,et al.  OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011 , 2010, Nucleic Acids Res..