Learning Retention Mechanisms and Evolutionary Parameters of Duplicate Genes from Their Expression Data

Learning about the roles that duplicate genes play in the origins of novel phenotypes requires an understanding of how their functions evolve. To date, only one method—CDROM—has been developed with this goal in mind. In particular, CDROM employs gene expression distances as proxies for functional divergence, and then classifies the evolutionary mechanisms retaining duplicate genes from comparisons of these distances in a decision tree framework. However, CDROM does not account for stochastic shifts in gene expression or leverage advances in contemporary statistical learning for performing classification, nor is it capable of predicting the underlying parameters of duplicate gene evolution. Thus, here we develop CLOUD, a multi-layer neural network built upon a model of gene expression evolution that can both classify duplicate gene retention mechanisms and predict their underlying evolutionary parameters. We show that not only is the CLOUD classifier substantially more powerful and accurate than CDROM, but that it also yields accurate parameter predictions, enabling a better understanding of the specific forces driving the evolution and long-term retention of duplicate genes. Further, application of the CLOUD classifier and predictor to empirical data from Drosophila recapitulates many previous findings about gene duplication in this lineage, showing that new functions often emerge rapidly and asymmetrically in younger duplicate gene copies, and that functional divergence is driven by strong natural selection. Hence, CLOUD represents the best available method for classifying retention mechanisms and predicting evolutionary parameters of duplicate genes, thereby also highlighting the utility of incorporating sophisticated statistical learning techniques to address long-standing questions about evolution after gene duplication.

[1]  Rasmus Nielsen,et al.  Modeling gene expression evolution with an extended Ornstein-Uhlenbeck process accounting for within-species variation. , 2014, Molecular biology and evolution.

[2]  A. King,et al.  Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution , 2004, The American Naturalist.

[3]  P. David,et al.  Diversity spurs diversification in ecological communities , 2017, Nature Communications.

[4]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[5]  Raquel Assis,et al.  CDROM: Classification of Duplicate gene RetentiOn Mechanisms , 2016, BMC Evolutionary Biology.

[6]  Michael DeGiorgio,et al.  Localizing and Classifying Adaptive Targets with Trend Filtered Regression , 2018, bioRxiv.

[7]  G. Churchill,et al.  Variation in gene expression within and among natural populations , 2002, Nature Genetics.

[8]  F. Kondrashov Gene duplication as a mechanism of genomic adaptation to a changing environment , 2012, Proceedings of the Royal Society B: Biological Sciences.

[9]  Daniel R. Schrider,et al.  The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference , 2018, bioRxiv.

[10]  Daniel R. Schrider,et al.  Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia , 2017, bioRxiv.

[11]  L. Revell,et al.  Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters , 2008 .

[12]  Raquel Assis,et al.  Rapid divergence and diversification of mammalian duplicate gene functions , 2015, BMC Evolutionary Biology.

[13]  Xueyuan Jiang,et al.  Rapid functional divergence of grass duplicate genes , 2018, bioRxiv.

[14]  Roland R. Regoes,et al.  Investigating the Consequences of Interference between Multiple CD8+ T Cell Escape Mutations in Early HIV Infection , 2016, PLoS Comput. Biol..

[15]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[16]  Páll Melsted,et al.  Comparative RNA sequencing reveals substantial genetic variation in endangered primates. , 2012, Genome research.

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  M. Goodisman,et al.  Gene duplication and the evolution of phenotypic diversity in insect societies , 2017, Evolution; international journal of organic evolution.

[19]  Andrew D. Kern,et al.  S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning , 2015, bioRxiv.

[20]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[21]  C. Xue,et al.  Preservation of duplicate genes by originalization , 2009, Genetica.

[22]  R. Nielsen,et al.  Phylogenetic ANOVA: The Expression Variance and Evolution Model for Quantitative Trait Evolution. , 2015, Systematic biology.

[23]  Liam J Revell,et al.  PHYLOGENETIC ANALYSIS OF THE EVOLUTIONARY CORRELATION USING LIKELIHOOD , 2009, Evolution; international journal of organic evolution.

[24]  G. Ridgeway The State of Boosting ∗ , 1999 .

[25]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[26]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[27]  A. Force,et al.  The probability of duplicate gene preservation by subfunctionalization. , 2000, Genetics.

[28]  Daniel R. Schrider,et al.  High mutational rates of large-scale duplication and deletion in Daphnia pulex , 2016, Genome research.

[29]  T. Montine,et al.  Glucocerebrosidase Deficiency in Drosophila Results in α-Synuclein-Independent Protein Aggregation and Neurodegeneration , 2016, PLoS genetics.

[30]  Raquel Assis Drosophila duplicate genes evolve new functions on the fly , 2014, Fly.

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  L. Ljung,et al.  Overtraining, regularization and searching for a minimum, with application to neural networks , 1995 .

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  A. Stoltzfus On the Possibility of Constructive Neutral Evolution , 1999, Journal of Molecular Evolution.

[37]  R. Waterston,et al.  Mutational and transcriptional landscape of spontaneous gene duplications and deletions in Caenorhabditis elegans , 2018, Proceedings of the National Academy of Sciences.

[38]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[39]  Jianzhi Zhang Evolution by gene duplication: an update , 2003 .

[40]  Trevor Bedford,et al.  Overdispersion of the molecular clock: temporal variation of gene-specific substitution rates in Drosophila. , 2008, Molecular biology and evolution.

[41]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[42]  V. Ranwez,et al.  MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons , 2018, Molecular biology and evolution.

[43]  Raquel Assis,et al.  Neofunctionalization of young duplicate genes in Drosophila , 2013, Proceedings of the National Academy of Sciences.

[44]  Daniel R. Schrider,et al.  diploS/HIC: An Updated Approach to Classifying Selective Sweeps , 2018, G3: Genes, Genomes, Genetics.

[45]  Xueyuan Jiang,et al.  Rapid functional divergence after small-scale gene duplication in grasses , 2019, BMC Evolutionary Biology.

[46]  Jeffrey R. Adrion,et al.  Predicting the Landscape of Recombination Using Deep Learning , 2020, Molecular biology and evolution.

[47]  Xueyuan Jiang,et al.  Natural Selection Drives Rapid Functional Evolution of Young Drosophila Duplicate Genes , 2017, Molecular biology and evolution.

[48]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[49]  D. Liberles,et al.  Subfunctionalization of duplicated genes as a transition state to neofunctionalization , 2005, BMC Evolutionary Biology.

[50]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[51]  Jianzhi Zhang,et al.  Rapid Subfunctionalization Accompanied by Prolonged and Substantial Neofunctionalization in Duplicate Gene Evolution , 2005, Genetics.

[52]  Christopher M. Bishop,et al.  Regularization and complexity control in feed-forward networks , 1995 .

[53]  T. F. Hansen STABILIZING SELECTION AND THE COMPARATIVE ANALYSIS OF ADAPTATION , 1997, Evolution; international journal of organic evolution.

[54]  Dave T. Gerrard,et al.  Gene expression divergence recapitulates the developmental hourglass model , 2010, Nature.

[55]  Sohini Ramachandran,et al.  Localization of adaptive variants in human genomes using averaged one-dependence estimation , 2017, Nature Communications.

[56]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[57]  Yoshua Bengio,et al.  Training Methods for Adaptive Boosting of Neural Networks , 1997, NIPS.

[58]  Christian Schlötterer,et al.  Distinguishing Positive Selection From Neutral Evolution: Boosting the Performance of Summary Statistics , 2011, Genetics.

[59]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[60]  Mehreen R Mughal,et al.  Learning the properties of adaptive regions with functional data analysis , 2019, bioRxiv.

[61]  Kendra J. Lipinski,et al.  High Spontaneous Rate of Gene Duplication in Caenorhabditis elegans , 2011, Current Biology.

[62]  A. Force,et al.  Preservation of duplicate genes by complementary, degenerative mutations. , 1999, Genetics.

[63]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[64]  Daniel R. Schrider,et al.  Rates and Genomic Consequences of Spontaneous Mutational Events in Drosophila melanogaster , 2013, Genetics.

[65]  G. Merceron,et al.  mvmorph: an r package for fitting multivariate evolutionary models to morphometric data , 2015 .

[66]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[67]  S. Pääbo,et al.  A Neutral Model of Transcriptome Evolution , 2004, PLoS biology.

[68]  Raquel Assis Out of the testis, into the ovary: biased outcomes of gene duplication and deletion in Drosophila , 2019, Evolution; international journal of organic evolution.

[69]  W. J. Dickinson,et al.  A genome-wide view of the spectrum of spontaneous mutations in yeast , 2008, Proceedings of the National Academy of Sciences.