Into the heart of darkness: large-scale clustering of human non-coding DNA

MOTIVATION It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited. RESULTS We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis. AVAILABILITY Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html

[1]  Charles M. Fiduccia,et al.  A linear-time heuristic for improving network partitions , 1988, 25 years of DAC.

[2]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[3]  Kenta Sumiyama,et al.  Regulation of Dlx3 gene expression in visceral arches by evolutionarily conserved enhancer elements , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[5]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[6]  A. Smit Interspersed repeats and other mementos of transposable elements in mammalian genomes. , 1999, Current opinion in genetics & development.

[7]  D Haussler,et al.  The share of human genomic DNA under selection estimated from human-mouse genomic alignments. , 2003, Cold Spring Harbor symposia on quantitative biology.

[8]  Walter Fontana,et al.  Fast folding and comparison of RNA secondary structures , 1994 .

[9]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[10]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[11]  David Haussler,et al.  Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments , 2003, RECOMB '03.

[12]  J. Mattick Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[13]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[14]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[15]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[16]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[17]  Axel Meyer,et al.  Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. , 2003, Genome research.

[18]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[19]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[20]  Martin Vingron,et al.  CORG: a database for COmparative Regulatory Genomics , 2003, Nucleic Acids Res..

[21]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[22]  M. Gerstein,et al.  Comparative analysis of processed pseudogenes in the mouse and human genomes. , 2004, Trends in genetics : TIG.

[23]  Alexandre Reymond,et al.  Evolutionary Discrimination of Mammalian Conserved Non-Genic Sequences (CNGs) , 2003, Science.

[24]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[25]  Jason Lee,et al.  BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[26]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.