Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes.

Using an unsupervised pattern-discovery method, we processed the human intergenic and intronic regions and catalogued all variable-length patterns with identically conserved copies and multiplicities above what is expected by chance. Among the millions of discovered patterns, we found a subset of 127,998 patterns, termed pyknons, which have additional nonoverlapping instances in the untranslated and protein-coding regions of 30,675 transcripts from 20,059 human genes. The pyknons arrange combinatorially in the untranslated and coding regions of numerous human genes where they form mosaics. Consecutive instances of pyknons in these regions show a strong bias in their relative placement, favoring distances of approximately 22 nucleotides. We also found pyknons to be enriched in a statistically significant manner in genes involved in specific processes, e.g., cell communication, transcription, regulation of transcription, signaling, transport, etc. For approximately 1/3 of the pyknons, the intergenic/intronic instances of their reverse complement lie within 380,084 nonoverlapping regions, typically 60-80 nucleotides long, which are predicted to form double-stranded, energetically stable, hairpin-shaped RNA secondary structures; additionally, the pyknons subsume approximately 40% of the known microRNA sequences, thus suggesting a possible link with posttranscriptional gene silencing and RNA interference. Cross-genome comparisons reveal that many of the pyknons have instances in the 3' UTRs of genes from other vertebrates and invertebrates where they are overrepresented in similar biological processes, as in the human genome. These unexpected findings suggest potential unique functional connections between the coding and noncoding parts of the human genome.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Walter Fontana,et al.  Fast folding and comparison of RNA secondary structures , 1994 .

[3]  V. Ambros,et al.  The Cold Shock Domain Protein LIN-28 Controls Developmental Timing in C. elegans and Is Regulated by the lin-4 RNA , 1997, Cell.

[4]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[5]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[6]  A. Fire,et al.  Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans , 1998, Nature.

[7]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[8]  C. Lawrence,et al.  Human-mouse genome comparisons to locate regulatory sites , 2000, Nature Genetics.

[9]  B. Reinhart,et al.  The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans , 2000, Nature.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  F. Slack,et al.  The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. , 2000, Molecular cell.

[12]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[13]  T. Tuschl,et al.  RNA interference is mediated by 21- and 22-nucleotide RNAs. , 2001, Genes & development.

[14]  V. Ambros,et al.  An Extensive Class of Small RNAs in Caenorhabditis elegans , 2001, Science.

[15]  S. P. Fodor,et al.  Evolutionarily conserved sequences on human chromosome 21. , 2001, Genome research.

[16]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[17]  Gary Ruvkun,et al.  Glimpses of a Tiny RNA World , 2001, Science.

[18]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[19]  K. Hashimoto,et al.  A transposable element-mediated gene divergence that directly produces a novel type bovine Bcnt protein including the endonuclease domain of RTE-1. , 2003, Molecular biology and evolution.

[20]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[21]  A. Sandelin,et al.  Identification of conserved regulatory elements by comparative genome analysis , 2003, Journal of biology.

[22]  C. Burge,et al.  Vertebrate MicroRNA Genes , 2003, Science.

[23]  R. Russell,et al.  bantam Encodes a Developmentally Regulated microRNA that Controls Cell Proliferation and Regulates the Proapoptotic Gene hid in Drosophila , 2003, Cell.

[24]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[25]  Ewan Birney,et al.  Discovering novel cis-regulatory motifs using functional networks. , 2003, Genome research.

[26]  Oliver Hobert,et al.  A microRNA controlling left/right neuronal asymmetry in Caenorhabditis elegans , 2003, Nature.

[27]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[28]  Noam Shomron,et al.  The Birth of an Alternatively Spliced Exon: 3' Splice-Site Selection in Alu Exons , 2003, Science.

[29]  V. Ambros,et al.  MicroRNAs and Other Tiny Endogenous RNAs in C. elegans , 2003, Current Biology.

[30]  E. Birney,et al.  Comparison of human chromosome 21 conserved nongenic sequences (CNGs) with the mouse and dog genomes shows that their selective constraint is independent of their genic environment. , 2004, Genome research.

[31]  O. Hobert Common logic of transcription factor and microRNA action. , 2004, Trends in biochemical sciences.

[32]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[33]  Restructuring the genome in response to adaptive challenge: McClintock's bold conjecture revisited. , 2004, Cold Spring Harbor symposia on quantitative biology.

[34]  N. Rajewsky,et al.  A pancreatic islet-specific microRNA regulates insulin secretion , 2004, Nature.

[35]  E. Birney,et al.  The Ensembl core software libraries. , 2004, Genome research.

[36]  J. Mattick RNA regulation: a new genetics? , 2004, Nature Reviews Genetics.

[37]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[38]  Sean R. Eddy,et al.  Pack-MULE transposable elements mediate gene evolution in plants , 2004, Nature.

[39]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[40]  Klaudia Walter,et al.  Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development , 2004, PLoS biology.

[41]  J. Mattick,et al.  Small regulatory RNAs in mammals. , 2005, Human molecular genetics.

[42]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.