Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

MOTIVATION Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. RESULTS k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats.

[1]  Uwe Scholz,et al.  Gene Content and Virtual Gene Order of Barley Chromosome 1H1[C][W][OA] , 2009, Plant Physiology.

[2]  R. O’Neill,et al.  A new class of retroviral and satellite encoded small RNAs emanates from mammalian centromeres , 2009, Chromosoma.

[3]  Richard M. Clark,et al.  Sequencing of natural strains of Arabidopsis thaliana with short reads. , 2008, Genome research.

[4]  G. Karpen,et al.  Epigenetic regulation of centromeric chromatin: old dogs, new tricks? , 2008, Nature Reviews Genetics.

[5]  S. Henikoff,et al.  Intergenic Locations of Rice Centromeric Chromatin , 2008, PLoS biology.

[6]  A. Narechania,et al.  Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats , 2008, BMC Genomics.

[7]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[8]  Liqing Zhang,et al.  Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction , 2008, Nucleic acids research.

[9]  Pavel Neumann,et al.  Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula , 2007, BMC Genomics.

[10]  Süleyman Cenk Sahinalp,et al.  Organization and Evolution of Primate Centromeric DNA from Whole-Genome Shotgun Sequence Data , 2007, PLoS Comput. Biol..

[11]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[12]  Thomas Rattei,et al.  Gepard: a rapid and sensitive tool for creating dotplots on genome scale , 2007, Bioinform..

[13]  B. Meyers,et al.  An expression atlas of rice mRNAs and small RNAs , 2007, Nature Biotechnology.

[14]  Jiming Jiang,et al.  Transcription and evolutionary dynamics of the centromeric satellite repeat CentO in rice. , 2006, Molecular biology and evolution.

[15]  D. Mather,et al.  Exact word matches in rice pseudomolecules. , 2006, Genome.

[16]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[17]  J. Macas,et al.  Sequence homogenization and chromosomal localization of VicTR-B satellites differ between closely related Vicia species , 2006, Chromosoma.

[18]  S. Jackson,et al.  Retrotransposon accumulation and satellite amplification mediated by segmental duplication facilitate centromere expansion in rice. , 2005, Genome research.

[19]  Jian Wang,et al.  ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun , 2005, PLoS Comput. Biol..

[20]  Jiming Jiang,et al.  Sobo, a Recently Amplified Satellite Repeat of Potato, and Its Implications for the Origin of Tandemly Repeated Sequences , 2005, Genetics.

[21]  J. Macas,et al.  Sequence subfamilies of satellite repeats related to rDNA intergenic spacer are differentially amplified on Vicia sativa chromosomes , 2003, Chromosoma.

[22]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[23]  D. Preuss,et al.  Centromere satellites from Arabidopsis populations: maintenance of conserved and variable domains. , 2003, Genome research.

[24]  D. Schindelhauer,et al.  Evidence for a fast, intrachromosomal conversion mechanism from mapping of nucleotide variants within a homogeneous alpha-satellite DNA array. , 2002, Genome research.

[25]  F. Blattner,et al.  Functional Rice Centromeres Are Marked by a Satellite Repeat and a Centromere-Specific Retrotransposon Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.003079. , 2002, The Plant Cell Online.

[26]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[27]  J. Macas,et al.  Two new families of tandem repeats isolated from genus Vicia using genomic self-priming PCR , 2000, Molecular and General Genetics MGG.

[28]  J. S. Heslop-Harrison,et al.  Polymorphisms and Genomic Organization of Repetitive DNA from Centromeric Regions of Arabidopsis Chromosomes , 1999, Plant Cell.

[29]  J. Pons,et al.  Conservation of satellite DNA in species of the genus Pimelia (Tenebrionidae, Coleoptera). , 1997, Gene.

[30]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[31]  Manolo Gouy,et al.  SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny , 1996, Comput. Appl. Biosci..

[32]  J. Elder,et al.  Concerted Evolution of Repetitive DNA Sequences in Eukaryotes , 1995, The Quarterly Review of Biology.

[33]  L. C. Hannah,et al.  Origin of the main class of repetitive DNA within selected Pennisetum species , 1993, Molecular and General Genetics MGG.

[34]  R. Stephens,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[35]  F. Hatch,et al.  Fractionation and characterization of satellite DNAs of the kangaroo rat (Dipodomys ordii). , 1974, Nucleic acids research.

[36]  BMC Genomics BioMed Central Methodology article Coupling amplified DNA from flow-sorted chromosomes to , 2008 .

[37]  Jirí Macas,et al.  PlantSat: a specialized database for plant satellite repeats , 2002, Bioinform..

[38]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[39]  R. Stupar,et al.  Instability of bacterial artificial chromosome (BAC) clones containing tandemly repeated DNA sequences. , 2001, Genome.