A Continuum of Evolving De Novo Genes Drives Protein-Coding Novelty in Drosophila

Orphan genes, lacking detectable homologs in outgroup species, typically represent 10–30% of eukaryotic genomes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila genus. Across the annotated proteomes of twelve species, we find 6297 orphan genes within 4953 taxon-specific clusters of orthologs. By inferring the ancestral DNA as non-coding for between 550 and 2467 (8.7–39.2%) of these genes, we describe for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of them having functional roles, we show that de novo genes have robust expression and translational support. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflect their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.

[1]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[2]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[3]  Claudio Casola,et al.  From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates , 2018, Genome biology and evolution.

[4]  Huifeng Jiang,et al.  De Novo Origination of a New Protein-Coding Gene in Saccharomyces cerevisiae , 2008, Genetics.

[5]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[6]  Lukasz Kurgan,et al.  Genome‐scale prediction of proteins with long intrinsically disordered regions , 2014, Proteins.

[7]  M. Nei,et al.  MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. , 2007, Molecular biology and evolution.

[8]  Lili Zhang,et al.  SmProt: a database of small proteins encoded by annotated coding and non‐coding RNA loci , 2017, Briefings Bioinform..

[9]  L. Serrano,et al.  Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins , 2004, Nature Biotechnology.

[10]  C. Kosiol,et al.  The life cycle of Drosophila orphan genes , 2014, eLife.

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[12]  César A. Hidalgo,et al.  Proto-genes and de novo gene birth , 2012, Nature.

[13]  D. Nurminsky,et al.  Analysis of the Drosophila melanogaster Testes Transcriptome Reveals Coordinate Regulation of Paralogous Genes , 2008, Genetics.

[14]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[15]  Joel Dudley,et al.  TimeTree: a public knowledge-base of divergence times among organisms , 2006, Bioinform..

[16]  A. Elofsson,et al.  The number of orphans in yeast and fly is drastically reduced by using combining searches in both proteomes and genomes , 2017, bioRxiv.

[17]  S. Sayols,et al.  The developmental proteome of Drosophila melanogaster , 2017, Genome research.

[18]  Andrew D Kern,et al.  Evidence for de Novo Evolution of Testis-Expressed Genes in the Drosophila yakuba/Drosophila erecta Clade , 2007, Genetics.

[19]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[20]  Jianzhi Zhang,et al.  Further Simulations and Analyses Demonstrate Open Problems of Phylostratigraphy , 2017, Genome biology and evolution.

[21]  Andrew D Kern,et al.  Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[23]  Manyuan Long,et al.  New Genes in Drosophila Quickly Become Essential , 2010, Science.

[24]  Eugene V Koonin,et al.  The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages , 2009, Proceedings of the National Academy of Sciences.

[25]  E. Bornberg-Bauer,et al.  Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover , 2018, Nature Ecology & Evolution.

[26]  Josephine A. Reinhardt,et al.  De Novo ORFs in Drosophila Are Important to Organismal Fitness and Evolved Rapidly from Previously Non-coding Sequences , 2013, PLoS genetics.

[27]  D. Tautz,et al.  The evolutionary origin of orphan genes , 2011, Nature Reviews Genetics.

[28]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[29]  J. Masel,et al.  Putatively Noncoding Transcripts Show Extensive Association with Ribosomes , 2011, Genome biology and evolution.

[30]  Eve Syrkin Wurtele,et al.  Recycling RNA-Seq Data to Identify Candidate Orphan Genes for Experimental Analysis , 2019 .

[31]  J. Masel,et al.  Young Genes are Highly Disordered as Predicted by the Preadaptation Hypothesis of De Novo Gene Birth , 2017, Nature Ecology &Evolution.

[32]  Peer Bork,et al.  Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster , 2002, Science.

[34]  E. Bornberg-Bauer,et al.  Evolutionary dynamics of simple sequence repeats across long evolutionary time scale in genus Drosophila , 2012 .

[35]  D. Petrov,et al.  Pervasive Natural Selection in the Drosophila Genome? , 2009, PLoS genetics.

[36]  Z. Yang,et al.  Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. , 2000, Molecular biology and evolution.

[37]  Jing Li,et al.  Landscape of the Dark Transcriptome Revealed Through Re-mining Massive RNA-Seq Data , 2019, bioRxiv.

[38]  Ying Chen Eyre-Walker,et al.  Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq , 2014, eLife.

[39]  G. Fischer,et al.  A Molecular Portrait of De Novo Genes in Yeasts , 2018, Molecular biology and evolution.

[40]  Arne Elofsson,et al.  High GC content causes orphan proteins to be intrinsically disordered , 2017, bioRxiv.

[41]  T. Bosch,et al.  More than just orphans: are taxonomically-restricted genes important in evolution? , 2009, Trends in genetics : TIG.

[42]  L. Hurst,et al.  Open questions in the study of de novo genes: what, how and why , 2016, Nature Reviews Genetics.

[43]  Desmond G. Higgins,et al.  GWIPS-viz: development of a ribo-seq genome browser , 2013, Nucleic Acids Res..

[44]  Patrick G. A. Pedrioli,et al.  A high-quality catalog of the Drosophila melanogaster proteome , 2007, Nature Biotechnology.

[45]  Doron Lancet,et al.  Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification , 2005, Bioinform..

[46]  Jiří Vondrášek,et al.  Random protein sequences can form defined secondary structures and are well-tolerated in vivo , 2017, Scientific Reports.

[47]  T. Riemensperger,et al.  Dopamine drives Drosophila sechellia adaptation to its toxic host , 2014, eLife.

[48]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[49]  Mihaela Zavolan,et al.  Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data , 2015, Genome Biology.

[50]  Annamária F. Ángyán,et al.  Estimating intrinsic structural preferences of de novo emerging random‐sequence proteins: Is aggregation the main bottleneck? , 2012, FEBS letters.

[51]  Tomislav Domazet-Loso,et al.  A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. , 2007, Trends in genetics : TIG.

[52]  M. Albà,et al.  Translation of neutrally evolving peptides provides a basis for de novo gene evolution , 2018, Nature Ecology & Evolution.

[53]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[54]  Arne Elofsson,et al.  Remote homology detection of integral membrane proteins using conserved sequence features , 2007, Proteins.

[55]  Mario Stanke,et al.  Simultaneous gene finding in multiple genomes , 2016, Bioinform..

[56]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[57]  A. Elofsson,et al.  Identifying and quantifying orphan protein sequences in fungi. , 2010, Journal of molecular biology.

[58]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[59]  H. Bussemaker,et al.  The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. , 2003, Genome research.

[60]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[61]  Anne-Ruxandra Carvunis,et al.  Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes , 2020, eLife.

[62]  Joshua G. Dunn,et al.  Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster , 2013, eLife.

[63]  A. Barbadilla,et al.  iMKT: the integrative McDonald and Kreitman test , 2019, Nucleic Acids Res..

[64]  Li Zhao,et al.  Testis single-cell RNA-seq reveals the dynamics of de novo gene transcription and germline mutational bias in Drosophila , 2019, bioRxiv.

[65]  A. McLysaght,et al.  New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[66]  Giulia Antonazzo,et al.  FlyBase 2.0: the next generation , 2018, Nucleic Acids Res..

[67]  D. Tautz,et al.  Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence , 2016, eLife.

[68]  E. Bornberg-Bauer,et al.  Detection of orphan domains in Drosophila using "hydrophobic cluster analysis". , 2015, Biochimie.

[69]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[70]  Jianzhi Zhang,et al.  Phylostratigraphic bias creates spurious patterns of genome evolution. , 2015, Molecular biology and evolution.

[71]  E. Bornberg-Bauer,et al.  Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA , 2017, F1000Research.

[72]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[73]  M. Moore From Birth to Death: The Complex Lives of Eukaryotic mRNAs , 2005, Science.

[74]  A. Elofsson,et al.  The number of orphans in yeast and fly is drastically reduced by using combining searches in both proteomes and genomes , 2017, bioRxiv.

[75]  E. Bornberg-Bauer,et al.  Mechanisms and Dynamics of Orphan Gene Emergence in Insect Genomes , 2013, Genome biology and evolution.

[76]  Sònia Casillas,et al.  PopFly: the Drosophila population genomics browser , 2017, Bioinform..

[77]  Li Zhao,et al.  Origin and Spread of de Novo Genes in Drosophila melanogaster Populations , 2014, Science.

[78]  A. McLysaght,et al.  Computational Prediction of De Novo Emerged Protein-Coding Genes. , 2018, Methods in molecular biology.

[79]  Zhiyu Peng,et al.  Rapid evolution of protein diversity by de novo origination in Oryza , 2019, Nature Ecology & Evolution.

[80]  D. Bartel,et al.  Widespread changes in the posttranscriptional landscape at the Drosophila oocyte-to-embryo transition. , 2014, Cell reports.

[81]  Baojun Wu,et al.  Tracing the De Novo Origin of Protein-Coding Genes in Yeast , 2018, mBio.

[82]  Yun Ding,et al.  On the origin of new genes in Drosophila. , 2008, Genome research.

[83]  Zsuzsanna Dosztányi,et al.  IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding , 2018, Nucleic Acids Res..

[84]  J. Masel,et al.  Foldability of a Natural De Novo Evolved Protein. , 2017, Structure.

[85]  Anna-Sophie Fiston-Lavier,et al.  Drosophila melanogaster recombination rate calculator. , 2010, Gene.

[86]  M. Albà,et al.  Long non-coding RNAs as a source of new peptides , 2014, eLife.

[87]  J. M. Comeron,et al.  The Many Landscapes of Recombination in Drosophila melanogaster , 2012, PLoS genetics.

[88]  Anne-Ruxandra Carvunis,et al.  De novo gene birth , 2019, PLoS genetics.

[89]  C. Landry,et al.  Differences Between the Raw Material and the Products of de Novo Gene Birth Can Result from Mutational Biases , 2019, Genetics.

[90]  Alisha K Holloway,et al.  Recently Evolved Genes Identified From Drosophila yakuba and D. erecta Accessory Gland Expressed Sequence Tags , 2005, Genetics.