Defining functional intergenic transcribed regions based on heterogeneous features of phenotype genes and pseudogenes

With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established in multiple model systems. However, whether intergenic expression reflects transcriptional noise or the activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely non-functional. To further assess the functionality of ITRs, we first built machine learning classifiers using Arabidopsis thaliana as a model that can accurately distinguish functional sequences (phenotype genes) and non-functional ones (pseudogenes and random unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models to ITRs, we found that 2,453 (21%) had features significantly similar to phenotype genes and thus were likely parts of functional genes, while an additional 17% resembled benchmark RNA genes. However, ~60% of ITRs were more similar to nonfunctional sequences and should be considered transcriptional noise unless falsified with experiments. The predictive framework establish here provides not only a comprehensive look at how functional, genic sequences are distinct from likely non-functional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities.

[1]  M. Nei,et al.  Pseudogenes as a paradigm of neutral evolution , 1981, Nature.

[2]  J. Ngernprasirtsiri,et al.  DNA methylation as a mechanism of transcriptional regulation in nonphotosynthetic plastids in plant cells. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  G. Lauder,et al.  Function without purpose , 1994 .

[4]  S. Rastan,et al.  Requirement for Xist in X chromosome inactivation , 1996, Nature.

[5]  R. Jaenisch,et al.  Xist-deficient mice are defective in dosage compensation but not spermatogenesis. , 1997, Genes & development.

[6]  S. Schreiber,et al.  Signaling Network Model of Chromatin , 2002, Cell.

[7]  R. Bodmer,et al.  The bereft gene, a potential target of the neural selector gene cut, contributes to bristle morphogenesis. , 2002, Genetics.

[8]  Joseph M. Dale,et al.  Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome , 2003, Science.

[9]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[10]  E. Schadt,et al.  Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. , 2005, Trends in genetics : TIG.

[11]  J. Bennetzen,et al.  Mechanisms of recent genome size variation in flowering plants. , 2005, Annals of botany.

[12]  N. Chua,et al.  MicroRNA Directs mRNA Cleavage of the Transcription Factor NAC1 to Downregulate Auxin Signals for Arabidopsis Lateral Root Development , 2005, The Plant Cell Online.

[13]  G. Phillips,et al.  Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Pellegrini,et al.  Genome-wide High-Resolution Mapping and Functional Analysis of DNA Methylation in Arabidopsis , 2006, Cell.

[15]  Lars Arvestad,et al.  Genome-Wide Survey for Biologically Functional Pseudogenes , 2006, PLoS Comput. Biol..

[16]  M. J. Harrison,et al.  Loss of At4 function impacts phosphate distribution between the roots and the shoots during phosphate starvation. , 2006, The Plant journal : for cell and molecular biology.

[17]  Radu Dobrin,et al.  Dissecting self-renewal in stem cells with RNA interference , 2006, Nature.

[18]  K. Akiyama,et al.  A trial of phenome analysis using 4000 Ds-insertional mutants in gene-coding regions of Arabidopsis. , 2006, The Plant journal : for cell and molecular biology.

[19]  J. Mattick,et al.  Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. , 2006, Trends in genetics : TIG.

[20]  Wen-Hsiung Li,et al.  A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. , 2007, Genome research.

[21]  R. Martienssen,et al.  Transposable elements and the epigenetic regulation of the genome , 2007, Nature Reviews Genetics.

[22]  B. Meyers,et al.  An expression atlas of rice mRNAs and small RNAs , 2007, Nature Biotechnology.

[23]  K. Struhl Transcriptional noise and the fidelity of initiation by RNA polymerase II , 2007, Nature Structural &Molecular Biology.

[24]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[25]  Michael Q. Zhang,et al.  Combinatorial patterns of histone acetylations and methylations in the human genome , 2008, Nature Genetics.

[26]  Jeannie T. Lee,et al.  Polycomb Proteins Targeted by a Short Repeat RNA to the Mouse X Chromosome , 2008, Science.

[27]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[28]  D. Sankoff,et al.  Polyploidy and angiosperm diversification. , 2009, American journal of botany.

[29]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[30]  S. Bell,et al.  Large-Scale Reverse Genetics in Arabidopsis: Case Studies from the Chloroplast 2010 Project1[C][W][OA] , 2009, Plant Physiology.

[31]  J. Mattick The Genetic Signatures of Noncoding RNAs , 2009, PLoS genetics.

[32]  Melissa D. Lehti-Shiu,et al.  Evolutionary and Expression Signatures of Pseudogenes in Arabidopsis and Rice1[C][W][OA] , 2009, Plant Physiology.

[33]  D. Tautz,et al.  Emergence of a New Gene from an Intergenic Region , 2009, Current Biology.

[34]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[35]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[36]  Wen-Hsiung Li,et al.  Uncovering Small RNA-Mediated Responses to Phosphate Deficiency in Arabidopsis by Deep Sequencing1[W][OA] , 2009, Plant Physiology.

[37]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[38]  P. Pandolfi,et al.  A coding-independent function of gene and pseudogene mRNAs regulates tumour biology , 2010, Nature.

[39]  James C. Schnable,et al.  Following Tetraploidy in Maize, a Short Deletion Mechanism Removed Genes Preferentially from One of the Two Homeologs , 2010, PLoS biology.

[40]  C. Ponting,et al.  Transcribed dark matter: meaning or myth? , 2010, Human molecular genetics.

[41]  Michael Q. Zhang,et al.  A long nuclear‐retained non‐coding RNA regulates synaptogenesis by modulating gene expression , 2010, EMBO Journal.

[42]  Sean M. Grimmond,et al.  The uniqueome: a mappability resource for short-tag sequencing , 2010, Bioinform..

[43]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[44]  D. Bartel,et al.  Conserved Function of lincRNAs in Vertebrate Embryonic Development despite Rapid Sequence Evolution , 2011, Cell.

[45]  B. Gaut,et al.  Lowly expressed genes in Arabidopsis thaliana bear the signature of possible pseudogenization by promoter degradation. , 2011, Molecular biology and evolution.

[46]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[47]  M. Freeling,et al.  Dose–Sensitivity, Conserved Non-Coding Sequences, and Duplicate Gene Retention Through Multiple Tetraploidies in the Grasses , 2011, Front. Plant Sci..

[48]  Karsten M. Borgwardt,et al.  Whole-genome sequencing of multiple Arabidopsis thaliana populations , 2011, Nature Genetics.

[49]  D. Meinke,et al.  A Comprehensive Dataset of Genes with a Loss-of-Function Mutant Phenotype in Arabidopsis , 2012, Plant Physiology.

[50]  James C. Schnable,et al.  Escape from Preferential Retention Following Repeated Whole Genome Duplications in Plants , 2012, Front. Plant Sci..

[51]  P. Kersey,et al.  Analysis of the bread wheat genome using whole genome shotgun sequencing , 2012, Nature.

[52]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[53]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[54]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[55]  Hailing Jin,et al.  Transcriptional Regulation of Arabidopsis MIR168a and ARGONAUTE1 Homeostasis in Abscisic Acid and Abiotic Stress Responses1[W] , 2012, Plant Physiology.

[56]  Melissa D. Lehti-Shiu,et al.  Characteristics and Significance of Intergenic Polyadenylated RNA Transcription in Arabidopsis1[W][OA] , 2012, Plant Physiology.

[57]  B. Gregory,et al.  PRMD: an integrated database for plant RNA modifications , 2012, Plant Cell.

[58]  D. Niu,et al.  Can ENCODE tell us how much junk DNA we carry in our genome? , 2013, Biochemical and biophysical research communications.

[59]  S. Eddy The ENCODE project: Missteps overshadowing a success , 2013, Current Biology.

[60]  Michael Morse,et al.  Multiple knockout mouse models reveal lincRNAs are required for life and brain development , 2013, eLife.

[61]  Sergio Alan Cervantes-Pérez,et al.  Architecture and evolution of a minute plant genome , 2013, Nature.

[62]  K. Shinozaki,et al.  Small open reading frames associated with morphogenesis are hidden in plant genomes , 2013, Proceedings of the National Academy of Sciences.

[63]  Peng Wang,et al.  A global map for dissecting phenotypic variants in human lincRNAs , 2013, European Journal of Human Genetics.

[64]  S. Jackson,et al.  The First 50 Plant Genomes , 2013 .

[65]  R. Last,et al.  Analysis of Essential Arabidopsis Nuclear Genes Encoding Plastid-Targeted Proteins , 2013, PloS one.

[66]  Alex P. Reynolds,et al.  Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution , 2013, Science.

[67]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[68]  Morgan C. Giddings,et al.  Defining functional DNA elements in the human genome , 2014, Proceedings of the National Academy of Sciences.

[69]  James B. Brown,et al.  Diversity and dynamics of the Drosophila transcriptome , 2014, Nature.

[70]  James B. Brown,et al.  Comparative validation of the D. melanogaster modENCODE transcriptome annotation , 2014, Genome research.

[71]  Weiqun Peng,et al.  Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. , 2014, Methods in molecular biology.

[72]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[73]  Shane J. Neph,et al.  Mapping and dynamics of regulatory DNA and transcription factor networks in A. thaliana. , 2014, Cell reports.

[74]  Haibao Tang,et al.  Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum , 2015, Nature.

[75]  J. Rinn,et al.  Diverse Phenotypes and Specific Transcription Patterns in Twenty Mouse Lines with Ablated LincRNAs , 2015, PloS one.

[76]  Huai-Kuang Tsai,et al.  Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast , 2015, PLoS Comput. Biol..

[77]  Matthew R. Hanlon,et al.  Araport: the Arabidopsis Information Portal , 2014, Nucleic Acids Res..

[78]  G. Howe,et al.  Determinants of nucleosome positioning and their influence on plant gene expression , 2015, Genome research.

[79]  S. Shiu,et al.  Characteristics of Plant Essential Genes Allow for within- and between-Species Prediction of Lethal Mutant Phenotypes[OPEN] , 2015, Plant Cell.

[80]  Thomas M. Keane,et al.  The BRAF Pseudogene Functions as a Competitive Endogenous RNA and Induces Lymphoma In Vivo , 2015, Cell.

[81]  L. Hillier,et al.  The time-resolved transcriptome of C. elegans , 2016, Genome research.

[82]  Mathew G. Lewsey,et al.  Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape , 2016, Cell.

[83]  Wei Wu,et al.  NONCODE 2016: an informative and valuable data source of long non-coding RNAs , 2015, Nucleic Acids Res..

[84]  S. Shiu,et al.  Defining Functional Genic Regions in the Human Genome through Integration of Biochemical, Evolutionary, and Genetic Evidence , 2017, Molecular biology and evolution.

[85]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .