POTION: an end-to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes

BackgroundDetection of genes evolving under positive Darwinian evolution in genome-scale data is nowadays a prevailing strategy in comparative genomics studies to identify genes potentially involved in adaptation processes. Despite the large number of studies aiming to detect and contextualize such gene sets, there is virtually no software available to perform this task in a general, automatic, large-scale and reliable manner. This certainly occurs due to the computational challenges involved in this task, such as the appropriate modeling of data under analysis, the computation time to perform several of the required steps when dealing with genome-scale data and the highly error-prone nature of the sequence and alignment data structures needed for genome-wide positive selection detection.ResultsWe present POTION, an open source, modular and end-to-end software for genome-scale detection of positive Darwinian selection in groups of homologous coding sequences. Our software represents a key step towards genome-scale, automated detection of positive selection, from predicted coding sequences and their homology relationships to high-quality groups of positively selected genes. POTION reduces false positives through several sophisticated sequence and group filters based on numeric, phylogenetic, quality and conservation criteria to remove spurious data and through multiple hypothesis corrections, and considerably reduces computation time thanks to a parallelized design. Our software achieved a high classification performance when used to evaluate a curated dataset of Trypanosoma brucei paralogs previously surveyed for positive selection. When used to analyze predicted groups of homologous genes of 19 strains of Mycobacterium tuberculosis as a case study we demonstrated the filters implemented in POTION to remove sources of errors that commonly inflate errors in positive selection detection. A thorough literature review found no other software similar to POTION in terms of customization, scale and automation.ConclusionTo the best of our knowledge, POTION is the first tool to allow users to construct and check hypotheses regarding the occurrence of site-based evidence of positive selection in non-curated, genome-scale data within a feasible time frame and with no human intervention after initial configuration. POTION is available at http://www.lmb.cnptia.embrapa.br/share/POTION/.

[1]  S. Sawyer Statistical tests for detecting gene conversion. , 1989, Molecular biology and evolution.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[4]  K. Khoo,et al.  Mycobacterial lipoarabinomannan: an extraordinary lipoheteroglycan with profound physiological effects. , 1998, Glycobiology.

[5]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[6]  J. Laclette,et al.  The PE-PGRS glycine-rich proteins of Mycobacterium tuberculosis: a new family of fibronectin-binding proteins? , 1999, Microbiology.

[7]  Ziheng Yang,et al.  Statistical methods for detecting molecular adaptation , 2000, Trends in Ecology & Evolution.

[8]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[9]  J. Retief,et al.  Phylogenetic analysis using PHYLIP. , 2000, Methods in molecular biology.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[12]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[13]  Z. Yang,et al.  Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. , 2001, Molecular biology and evolution.

[14]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[15]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[16]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[17]  C. Fritz,et al.  Dependence of Mycobacterium bovis BCG on Anaerobic Nitrate Reductase for Persistence Is Tissue Specific , 2002, Infection and Immunity.

[18]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[19]  R. Nielsen,et al.  Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. , 2003, Genetics.

[20]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[23]  Daniel Nilsson,et al.  Comparative Genomics of Trypanosomatid Parasitic Protozoa , 2005, Science.

[24]  Martin Kuiper,et al.  BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks , 2005, Bioinform..

[25]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[26]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Kapralov,et al.  Widespread positive selection in the photosynthetic Rubisco enzyme , 2007, BMC Evolutionary Biology.

[28]  D. Bryant,et al.  A Simple and Robust Statistical Test for Detecting the Presence of Recombination , 2006, Genetics.

[29]  Jon R. Armstrong,et al.  Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[30]  M. Stanhope,et al.  Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition , 2007, Genome Biology.

[31]  Yun-Xin Fu,et al.  Evidence for Recombination in Mycobacterium tuberculosis , 2006, Journal of bacteriology.

[32]  Rodrigo Gouveia-Oliveira,et al.  MaxAlign: maximizing usable data in an alignment , 2007, BMC Bioinformatics.

[33]  Maria Anisimova,et al.  Phylogenomic analysis of natural selection pressure in Streptococcus genomes , 2007, BMC Evolutionary Biology.

[34]  Adi Doron-Faigenboim,et al.  Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach , 2007, Nucleic Acids Res..

[35]  Matthew W. Dimmic,et al.  Genes under positive selection in Escherichia coli. , 2007, Genome research.

[36]  T. Pupko,et al.  A combined empirical and mechanistic codon model. , 2006, Molecular biology and evolution.

[37]  Maria Anisimova,et al.  Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites. , 2007, Molecular biology and evolution.

[38]  Martin Wiedmann,et al.  Genome-wide analyses reveal lineage specific contributions of positive selection and recombination to the evolution of Listeria monocytogenes , 2008, BMC Evolutionary Biology.

[39]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[40]  Jonathan Crabtree,et al.  IDEA: Interactive Display for Evolutionary Analyses , 2008, BMC Bioinformatics.

[41]  J. Plotkin,et al.  The Population Genetics of dN/dS , 2008, PLoS genetics.

[42]  Richard D. Emes,et al.  Duplicated Paralogous Genes Subject to Positive Selection in the Genome of Trypanosoma brucei , 2008, PloS one.

[43]  R. Nielsen,et al.  Patterns of Positive Selection in Six Mammalian Genomes , 2008, PLoS genetics.

[44]  Tristan Lefébure,et al.  Pervasive, genome-wide positive selection leading to functional divergence in the bacterial genus Campylobacter. , 2009, Genome research.

[45]  O. Gascuel,et al.  Estimating maximum likelihood phylogenies with PhyML. , 2009, Methods in molecular biology.

[46]  C. Ponting,et al.  Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa , 2009, PLoS genetics.

[47]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[48]  Gaston H. Gonnet,et al.  Estimates of Positive Darwinian Selection Are Inflated by Errors in Sequencing, Annotation, and Alignment , 2009, Genome biology and evolution.

[49]  Martin Wiedmann,et al.  Genome wide evolutionary analyses reveal serotype specific patterns of positive selection in selected Salmonella serotypes , 2009, BMC Evolutionary Biology.

[50]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[51]  Sergei L. Kosakovsky Pond,et al.  Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology , 2010, Bioinform..

[52]  James E. Hayes,et al.  JCoDA: a tool for detecting evolutionary selection , 2010, BMC Bioinformatics.

[53]  R. David The two faces of MycP1. , 2010, Nature reviews. Microbiology.

[54]  Rui Zhou,et al.  Genome-wide evidence for positive selection and recombination in Actinobacillus pleuropneumoniae , 2011, BMC Evolutionary Biology.

[55]  Cédric Cabau,et al.  PhyleasProg: a user-oriented web server for wide evolutionary analyses , 2011, Nucleic Acids Res..

[56]  Joaquín Dopazo,et al.  Genome analysis Advance Access publication February 18, 2011 B2G-FAR, a species-centered GO annotation repository , 2022 .

[57]  Dmitri Petrov,et al.  High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. , 2011, Genome research.

[58]  Yang Zhong,et al.  Genes under positive selection in Mycobacterium tuberculosis , 2011, Comput. Biol. Chem..

[59]  Adamandia Kapopoulou,et al.  TubercuList--10 years after. , 2011, Tuberculosis.

[60]  M. Stanhope,et al.  Comparative genomic analysis of the genus Staphylococcus including Staphylococcus aureus and its newly described sister species Staphylococcus simiae , 2012, BMC Genomics.

[61]  S. Sampson,et al.  Mycobacterial PE/PPE Proteins at the Host-Pathogen Interface , 2011, Clinical & developmental immunology.

[62]  M. Kanehisa,et al.  Using the KEGG Database Resource , 2005, Current protocols in bioinformatics.

[63]  Arnold Kuzniar,et al.  gcodeml: A Grid-enabled Tool for Detecting Positive Selection in Biological Evolution , 2012, HealthGrid.

[64]  E. Böttger,et al.  Important Role for Mycobacterium tuberculosis UvrD1 in Pathogenesis and Persistence apart from Its Function in Nucleotide Excision Repair , 2012, Journal of bacteriology.

[65]  Andreas Tauch,et al.  KOMODO: a web tool for detecting and visualizing biased distribution of groups of homologous genes in monophyletic taxa , 2012, Nucleic Acids Res..

[66]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[67]  Ping Xu,et al.  PSP: rapid identification of orthologous coding genes under positive selection across multiple closely related prokaryotic genomes , 2013, BMC Genomics.

[68]  Marcelo Serrano Zanetti,et al.  CodonPhyML: Fast Maximum Likelihood Phylogeny Estimation under Codon Substitution Models , 2013, Molecular biology and evolution.

[69]  K. Lindblad-Toh,et al.  Comparative genomics as a tool to understand evolution and disease , 2013, Genome research.

[70]  A. Namouchi,et al.  Evolution of Smooth Tubercle Bacilli PE and PE_PGRS Genes: Evidence for a Prominent Role of Recombination and Imprint of Positive Selection , 2013, PloS one.

[71]  Arnold Kuzniar,et al.  Selectome update: quality control and computational improvements to a database of positive selection , 2013, Nucleic Acids Res..

[72]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[73]  Josephine T. Daub,et al.  Patterns of Positive Selection in Seven Ant Genomes , 2013, Molecular biology and evolution.