Repetitive Elements May Comprise Over Two-Thirds of the Human Genome

Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.

[1]  M. Batzer,et al.  Reading TE leaves: new approaches to the identification of transposable element insertions. , 2011, Genome research.

[2]  Samuel E. Fox,et al.  Discovery of Highly Divergent Repeat Landscapes in Snake Genomes Using High-Throughput Sequencing , 2011, Genome biology and evolution.

[3]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[4]  E. Lerat Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs , 2010, Heredity.

[5]  Albert J. Vilella,et al.  The genome of a songbird , 2010, Nature.

[6]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[7]  Nirmal Ranganathan,et al.  Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic Genomes , 2009, Genome biology and evolution.

[8]  Wanjun Gu,et al.  Identification of repeat structure in large genomes using repeat probability clouds. , 2008, Analytical biochemistry.

[9]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[10]  Jean L. Chang,et al.  Initial sequence and comparative analysis of the cat genome. , 2007, Genome research.

[11]  J. Jurka,et al.  Repetitive sequences in complex genomes: structure and evolution. , 2007, Annual review of genomics and human genetics.

[12]  J. Mattick,et al.  Raising the estimate of functional human sequences. , 2007, Genome research.

[13]  Bronwen L. Aken,et al.  Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences , 2007, Nature.

[14]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[15]  Frédéric Boyer,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2005 .

[16]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[17]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[18]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[19]  Jian Wang,et al.  ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun , 2005, PLoS Comput. Biol..

[20]  Casey M. Bergman,et al.  Combined Evidence Annotation of Transposable Elements in Genome Sequences , 2005, PLoS Comput. Biol..

[21]  Martin C Frith,et al.  Genomics: The amazing complexity of the human transcriptome , 2005, European Journal of Human Genetics.

[22]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[23]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[24]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[25]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[26]  H. Kazazian Mobile Elements: Drivers of Genome Evolution , 2004, Science.

[27]  Jürgen Brosius,et al.  Genomes were forged by massive bombardments with retroelements and retrosequences , 2004, Genetica.

[28]  E. Kirkness,et al.  The Dog Genome: Survey Sequencing and Comparative Analysis , 2003, Science.

[29]  S. Kurtz The Vmatch large scale sequence analysis software , 2003 .

[30]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[31]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[32]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[33]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[34]  E. Eichler,et al.  Recent duplication, domain accretion and the dynamic mutation of the human genome. , 2001, Trends in genetics : TIG.

[35]  A. Nekrutenko,et al.  Transposable elements are found in a large number of human protein-coding genes. , 2001, Trends in genetics : TIG.

[36]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[37]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[38]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[39]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[40]  Jerzy Jurka,et al.  Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era , 1995, Nucleic Acids Res..

[41]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.