In search of lost introns

UNLABELLED Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon-intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools. We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(n l) preprocessing time, subsequent evaluations take O(n l/log l) time almost surely in the Yule-Harding random model of n-taxon phylogenies, where l is the input sequence length. We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, which is more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now. AVAILABILITY The Java implementations of the algorithms are publicly available from the corresponding author's site http://www.iro.umontreal.ca/~csuros/introns/. Data are available on request.

[1]  B. Birren,et al.  Patterns of Intron Gain and Loss in Fungi , 2004, PLoS biology.

[2]  Bret Larget,et al.  Faster likelihood calculations on trees , 1998 .

[3]  D. Penny,et al.  Patterns of intron loss and gain in plants: intron loss-dominated evolution and genome-wide comparison of O. sativa and A. thaliana. , 2006, Molecular biology and evolution.

[4]  Andrew G McArthur,et al.  A spliceosomal intron in Giardia lamblia , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Joab R Winkler,et al.  Numerical recipes in C: The art of scientific computing, second edition , 1993 .

[6]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[7]  E. Harding The probabilities of rooted tree-shapes generated by random bifurcation , 1971, Advances in Applied Probability.

[8]  E. Koonin,et al.  Remarkable Interkingdom Conservation of Intron Positions and Massive, Lineage-Specific Intron Loss and Gain in Eukaryotic Evolution , 2003, Current Biology.

[9]  Eugene V Koonin,et al.  A glimpse of a putative pre-intron phase of eukaryotic evolution. , 2007, Trends in genetics : TIG.

[10]  M. Csűros,et al.  Maximum-scoring segment sets , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[12]  Miklós Csürös,et al.  Likely Scenarios of Intron Evolution , 2005, Comparative Genomics.

[13]  P. Bork,et al.  Vertebrate-Type Intron-Rich Genes in the Marine Annelid Platynereis dumerilii , 2005, Science.

[14]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[15]  Jim Freeman,et al.  Stochastic Processes (Second Edition) , 1996 .

[16]  D. Penny,et al.  The biology of intron gain and loss. , 2006, Trends in genetics : TIG.

[17]  Lesley Collins,et al.  Complex spliceosomal organization ancestral to extant eukaryotes. , 2005, Molecular biology and evolution.

[18]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[19]  Miklós Csürös,et al.  Maximum-Scoring Segment Sets , 2004, IEEE ACM Trans. Comput. Biol. Bioinform..

[20]  Ying Wang,et al.  Insights into social insects from the genome of the honeybee Apis mellifera , 2006, Nature.

[21]  J. Carlton,et al.  Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[23]  J. Coulombe-Huntington,et al.  Characterization of intron loss events in mammals. , 2006, Genome research.

[24]  Andreas Prlic,et al.  Ensembl 2007 , 2006, Nucleic Acids Res..

[25]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[26]  S. Heard,et al.  PATTERNS IN TREE BALANCE AMONG CLADISTIC, PHENETIC, AND RANDOMLY GENERATED PHYLOGENETIC TREES , 1992, Evolution; international journal of organic evolution.

[27]  Walter Gilbert,et al.  Complex early genes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[29]  M. Steel,et al.  Distributions of cherries for two models of trees. , 2000, Mathematical biosciences.

[30]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[31]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[32]  Kimberly Van Auken,et al.  WormBase: new content and better access , 2006, Nucleic Acids Res..

[33]  Olivier François,et al.  On statistical tests of phylogenetic tree imbalance: the Sackin and other indices revisited. , 2005, Mathematical biosciences.

[34]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[35]  Noah A. Rosenberg,et al.  The Mean and Variance of the Numbers of r-Pronged Nodes and r-Caterpillars in Yule-Generated Genealogical Trees , 2006 .

[36]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[37]  Igor B. Rogozin,et al.  Analysis of evolution of exon-intron structure of eukaryotic genes , 2005, Briefings Bioinform..

[38]  Thomas Ludwig,et al.  AxML: a fast program for sequential and parallel phylogenetic tree calculations based on the maximum likelihood method , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[39]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[40]  D. Aldous Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today , 2001 .

[41]  Liran Carmel,et al.  An Expectation-Maximization Algorithm for Analysis of Evolution of Exon-Intron Structure of Eukaryotic Genes , 2005, Comparative Genomics.

[42]  Walter Gilbert,et al.  The evolution of spliceosomal introns: patterns, puzzles and progress , 2006, Nature Reviews Genetics.

[43]  D. Penny,et al.  Large-scale intron conservation and order-of-magnitude variation in intron loss/gain rates in apicomplexan evolution. , 2006, Genome research.

[44]  Bin Ma,et al.  Alignment between Two Multiple Alignments , 2003, CPM.

[45]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[46]  Hung D. Nguyen,et al.  New Maximum Likelihood Estimators for Eukaryotic Intron Evolution , 2005, PLoS Comput. Biol..

[47]  Scott W Roy,et al.  Intron-rich ancestors. , 2006, Trends in genetics : TIG.

[48]  Luc Devroye,et al.  Limit Laws for Local Counters in Random Binary Search Tree , 1991, Random Struct. Algorithms.

[49]  M. Csűrös Likely scenarios of intron evolution , 2005, RECOMB 2005.

[50]  Joseph Felsenstein,et al.  PHYLOGENIES FROM RESTRICTION SITES: A MAXIMUM‐LIKELIHOOD APPROACH , 1992, Evolution; international journal of organic evolution.

[51]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[52]  Tal Pupko,et al.  A structural EM algorithm for phylogenetic inference , 2001, J. Comput. Biol..

[53]  Sergei L. Kosakovsky Pond,et al.  Column sorting: rapid calculation of the phylogenetic likelihood function. , 2004, Systematic biology.

[54]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[55]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[56]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[57]  E. Koonin,et al.  Conservation versus parallel gains in intron evolution , 2005, Nucleic acids research.