GenomeScope: Fast reference-free genome profiling from short reads

Summary: GenomeScope is an open‐source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels and error rates. Availability and Implementation: http://genomescope.org, https://github.com/schatzlab/genomescope.git. Contact: mschatz@jhu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[2]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[3]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[4]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[5]  Páll Melsted,et al.  KmerStream: Streaming algorithms for k-mer abundance estimation , 2014, bioRxiv.

[6]  Douglas M. Bates,et al.  Nonlinear Regression Analysis and Its Applications , 1988 .

[7]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[8]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[9]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[12]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[13]  Michael C. Schatz,et al.  Teaser: Individualized benchmarking and optimization of read mapping results for NGS data , 2015, bioRxiv.

[14]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[15]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16]  Jianying Yuan,et al.  Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects , 2013, 1308.2012.

[17]  Jared T. Simpson,et al.  Exploring genome characteristics and sequence quality without a reference , 2013, Bioinform..