A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

BackgroundThe challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.ResultsHere we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.ConclusionThe Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.

[1]  W. Doolittle,et al.  Selfish genes, the phenotype paradigm and genome evolution , 1980, Nature.

[2]  F. Crick,et al.  Selfish DNA: the ultimate parasite , 1980, Nature.

[3]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  D. Bullock,et al.  Nuclear DNA content in F1 hybrids of maize , 1993, Heredity.

[6]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[7]  Phillip SanMiguel,et al.  The paleontology of intergene retrotransposons of maize , 1998, Nature Genetics.

[8]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[9]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[10]  Eugene W. Myers,et al.  Design of a compartmentalized shotgun assembler for the human genome , 2001, ISMB.

[11]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[12]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[13]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[14]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[15]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[16]  C. Soderlund,et al.  Access to the maize genome: an integrated physical and genetic map. , 2002, Plant physiology.

[17]  J. Bennetzen,et al.  Transposable elements, genes and recombination in a 215-kb contig from wheat chromosome 5Am , 2002, Functional & Integrative Genomics.

[18]  J Quackenbush,et al.  Enrichment of Gene-Coding Sequences in Maize by Genome Filtration , 2003, Science.

[19]  Phillip SanMiguel,et al.  Structure and evolution of the Cinful retrotransposon family of maize. , 2003, Genome.

[20]  Arnaud Lefebvre,et al.  FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[21]  S. Kurtz The Vmatch large scale sequence analysis software , 2003 .

[22]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[23]  W. McCombie,et al.  Comparative analysis of a Brassica BAC clone containing several major aliphatic glucosinolate genes with its corresponding Arabidopsis sequence. , 2004, Genome.

[24]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[25]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[26]  Qunfeng Dong,et al.  PlantGDB, plant genome database and analysis tools , 2004, Nucleic Acids Res..

[27]  Joachim Messing,et al.  Gene movement by Helitron transposons contributes to the haplotype variability of maize. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  李佩芳 International Rice Genome Sequencing Project. 2005. The map-based sequence of the rice genome. , 2005 .

[29]  Takuji Sasaki,et al.  The map-based sequence of the rice genome , 2005, Nature.

[30]  Jian Wang,et al.  ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun , 2005, PLoS Comput. Biol..

[31]  Giorgio Valle,et al.  BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes , 2004 .

[32]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[33]  B. Birren,et al.  Structure and Architecture of the Maize Genome1[W] , 2005, Plant Physiology.

[34]  Galina Fuks,et al.  Whole-Genome Validation of High-Information-Content Fingerprinting1 , 2005, Plant Physiology.

[35]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[36]  W. Rooney,et al.  Genome evolution in the genus Sorghum (Poaceae). , 2005, Annals of botany.

[37]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[38]  Claire Mathieu,et al.  On the Sum-of-Squares algorithm for bin packing , 2002, JACM.

[39]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[40]  Carene Rizzon,et al.  Striking Similarities in the Genomic Distribution of Tandemly Arrayed Genes in Arabidopsis and Rice , 2006, PLoS Comput. Biol..

[41]  Li Zheng,et al.  The TIGR Maize Database , 2005, Nucleic Acids Res..

[42]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[43]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[44]  Li Yang,et al.  MIPSPlantsDB—plant database resource for integrative and comparative plant genome research , 2007, Nucleic Acids Res..

[45]  J. Poulain,et al.  The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla , 2007, Nature.

[46]  Francois Sabot,et al.  Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats , 2008, BMC Genomics.

[47]  GnanaSundar Rajendiran,et al.  Clustering Method for Repeat Analysis in DNA sequences , 2008 .

[48]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[49]  M. Ibrahim,et al.  Whole-Genome Resequencing , 2009 .