Split-inducing indels in phylogenomic analysis

BackgroundMost phylogenetic studies using molecular data treat gaps in multiple sequence alignments as missing data or even completely exclude alignment columns that contain gaps.ResultsHere we show that gap patterns in large-scale, genome-wide alignments are themselves phylogenetically informative and can be used to infer reliable phylogenies provided the gap data are properly filtered to reduce noise introduced by the alignment method. We introduce here the notion of split-inducing indels (splids) that define an approximate bipartition of the taxon set. We show both in simulated data and in case studies on real-life data that splids can be efficiently extracted from phylogenomic data sets.ConclusionsSuitably processed gap patterns extracted from genome-wide alignment provide a surprisingly clear phylogenetic signal and an allow the inference of accurate phylogenetic trees.

[1]  Terence Hwa,et al.  Regional and Time-resolved Mutation Patterns of the Human Genome , 2004, German Conference on Bioinformatics.

[2]  Webb Miller,et al.  Using genomic data to unravel the root of the placental mammal phylogeny. , 2007, Genome research.

[3]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[4]  T. Warnow Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent , 2012, PLoS currents.

[5]  Tal Pupko,et al.  Indel Reliability in Indel-Based Phylogenetic Inference , 2014, Genome biology and evolution.

[6]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[7]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[8]  K. Kakinuma,et al.  Phylogenetic Analysis of Salmonella, Shigella, and Escherichia coli Strains on the Basis of the gyrB Gene Sequence , 2002, Journal of Clinical Microbiology.

[9]  Adam Baldwin,et al.  Bmc Microbiology Multilocus Sequence Typing of Cronobacter Sakazakii and Cronobacter Malonaticus Reveals Stable Clonal Structures with Clinical Significance Which Do Not Correlate with Biotypes , 2022 .

[10]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[11]  A. Phillips,et al.  Multiple sequence alignment in phylogenetic analysis. , 2000, Molecular phylogenetics and evolution.

[12]  Ari Löytynoja,et al.  webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser , 2010, BMC Bioinformatics.

[13]  Mark P. Simmons,et al.  A confounding effect of missing data on character conflict in maximum likelihood and Bayesian MCMC phylogenetic analyses. , 2014, Molecular phylogenetics and evolution.

[14]  Xun Gu,et al.  The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment , 1995, Journal of Molecular Evolution.

[15]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[16]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[17]  Reed A. Cartwright,et al.  DNA assembly with gaps (Dawg): simulating sequence evolution , 2005, Bioinform..

[18]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[19]  N. Packer,et al.  Structure of the O antigen of Escherichia coli K-12 and the sequence of its rfb gene cluster , 1994, Journal of bacteriology.

[20]  Tamás Papp,et al.  Re-Mind the Gap! Insertion – Deletion Data Reveal Neglected Phylogenetic Potential of the Nuclear Ribosomal Internal Transcribed Spacer (ITS) of Fungi , 2012, PloS one.

[21]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[22]  R. Britten,et al.  Majority of divergence between closely related DNA samples is due to indels , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Toni Gabaldón,et al.  Measuring guide-tree dependency of inferred gaps in progressive aligners , 2013, Bioinform..

[24]  Ofir Cohen,et al.  Large-scale parsimony analysis of metazoan indels in protein-coding genes. , 2010, Molecular biology and evolution.

[25]  Benjamin P. Blackburne,et al.  Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty , 2015, Genome biology and evolution.

[26]  Gerton Lunter,et al.  Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes , 2007, ISMB/ECCB.

[27]  J. Geisler,et al.  Phylogenetic relationships of Icaronycteris, Archaeonycteris, Hassianycteris, and Palaeochiropteryx to extant bat lineages, with comments on the evolution of echolocation and foraging strategies in Microchiroptera. Bulletin of the AMNH ; no. 235 , 1998 .

[28]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[29]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[30]  P. Ericson,et al.  Phylogenetic utility and evolution of indels: a study in neognathous birds. , 2011, Molecular phylogenetics and evolution.

[31]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[32]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[33]  Bruno Nyffeler,et al.  Early History of Mammals Is Elucidated with the ENCODE Multiple Species Sequencing Data , 2007, PLoS genetics.

[34]  Olivier Gascuel,et al.  Genomics, biogeography, and the diversification of placental mammals , 2007, Proceedings of the National Academy of Sciences.

[35]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[36]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[37]  Eric D. Green,et al.  Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets , 2008, Molecular biology and evolution.

[38]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[39]  S. O’Brien,et al.  A Molecular Phylogeny for Bats Illuminates Biogeography and the Fossil Record , 2005, Science.

[40]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[41]  Thomas Mailund,et al.  tqDist: a library for computing the quartet and triplet distances between binary or general trees , 2014, Bioinform..

[42]  Mark Gerstein,et al.  Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. , 2003, Nucleic acids research.

[43]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[44]  Pavel A Pevzner,et al.  Mammalian phylogenomics comes of age. , 2004, Trends in genetics : TIG.

[45]  Emma C. Teeling,et al.  Microbat paraphyly and the convergent evolution of a key innovation in Old World rhinolophoid microbats , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[46]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[47]  Lior Pachter,et al.  Tracing the Most Parsimonious Indel History , 2011, J. Comput. Biol..

[48]  Jonathan W. Pillow,et al.  POSTER PRESENTATION Open Access , 2013 .

[49]  L. Foulds,et al.  Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences , 1982, Nature.

[50]  M. Stanhope,et al.  Molecules consolidate the placental mammal tree. , 2004, Trends in ecology & evolution.

[51]  C. Randal Linder,et al.  Multiple sequence alignment: a major challenge to large-scale phylogenetics , 2011, PLoS currents.

[52]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[53]  Alexandros Stamatakis,et al.  How Many Bootstrap Replicates Are Necessary? , 2009, RECOMB.

[54]  Thomas Mailund,et al.  QDist-quartet distance between evolutionary trees , 2004, Bioinform..

[55]  Walter L. Ruzzo,et al.  How accurately is ncRNA aligned within whole-genome multiple alignments? , 2007, BMC Bioinformatics.

[56]  Christian J. Michel,et al.  A stochastic evolution model for residue Insertion-Deletion Independent from Substitution , 2010, Comput. Biol. Chem..

[57]  B. Boussau,et al.  Genomes as documents of evolutionary history. , 2010, Trends in ecology & evolution.

[58]  E. Harley,et al.  Mitogenomic relationships of placental mammals and molecular estimates of their divergences. , 2008, Gene.

[59]  Denis C. Bauer,et al.  Studying the functional conservation of cis-regulatory modules and their transcriptional output , 2008, BMC Bioinformatics.

[60]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[61]  Diana J. Kao,et al.  Molecular evidence for multiple origins of Insectivora and for a new order of endemic African insectivore mammals. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[62]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[63]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[64]  J. Pettigrew,et al.  Base-compositional biases and the bat problem. III. The questions of microchiropteran monophyly. , 1998, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[65]  Sonja J. Prohaska,et al.  Proteinortho: Detection of (Co-)orthologs in large-scale analysis , 2011, BMC Bioinformatics.

[66]  O. Ryder,et al.  Incorporating indels as phylogenetic characters: impact for interfamilial relationships within Arctoidea (Mammalia: Carnivora). , 2013, Molecular phylogenetics and evolution.

[67]  M. Gil,et al.  Phylogenetic assessment of alignments reveals neglected tree signal in gaps , 2010, Genome Biology.

[68]  K. Müller,et al.  Incorporating information from length-mutational events into phylogenetic analysis. , 2006, Molecular phylogenetics and evolution.

[69]  David G. Lloyd,et al.  Multi‐residue gaps, a class of molecular characters with exceptional reliability for phylogenetic analyses , 1991 .

[70]  Cizhong Jiang,et al.  Phylogenetic affinity of tree shrews to Glires is attributed to fast evolution rate. , 2014, Molecular phylogenetics and evolution.

[71]  P. Reeves,et al.  Sequence variation in Shigella sonnei (Sonnei), a pathogenic clone of Escherichia coli, over four continents and 41 years , 1994, Journal of clinical microbiology.

[72]  M. Kiefmann,et al.  Retroposed Elements as Archives for the Evolutionary History of Placental Mammals , 2006, PLoS biology.

[73]  M. Suchard,et al.  Incorporating indel information into phylogeny estimation for rapidly emerging pathogens , 2007, BMC Evolutionary Biology.

[74]  Aleksey Y Ogurtsov,et al.  Indel-based evolutionary distance and mouse-human divergence. , 2004, Genome research.

[75]  Sudhindra R Gadagkar,et al.  Phylogenetic inference under varying proportions of indel-induced alignment gaps , 2009, BMC Evolutionary Biology.

[76]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[77]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[78]  B. Korczak,et al.  Phylogeny and prediction of genetic similarity of Cronobacter and related taxa by multilocus sequence analysis (MLSA). , 2009, International journal of food microbiology.

[79]  Simon Whelan,et al.  Class of multiple sequence alignment algorithm affects genomic analysis. , 2013, Molecular biology and evolution.

[80]  Robert Lanfear,et al.  PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. , 2016, Molecular biology and evolution.

[81]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[82]  B. Mccardell,et al.  Cronobacter gen. nov., a new genus to accommodate the biogroups of Enterobacter sakazakii, and proposal of Cronobacter sakazakii gen. nov., comb. nov., Cronobacter malonaticus sp. nov., Cronobacter turicensis sp. nov., Cronobacter muytjensii sp. nov., Cronobacter dublinensis sp. nov., Cronobacter ge , 2008, International journal of systematic and evolutionary microbiology.

[83]  Mark P. Simmons,et al.  Gaps as characters in sequence-based phylogenetic analyses. , 2000, Systematic biology.