Improving Metagenomic Assemblies Through Data Partitioning: a GC content approach

Assembling metagenomic data sequenced by NGS platforms poses significant computational challenges, especially due to large volumes of data, sequencing errors, and variations in size, complexity, diversity and abundance of organisms present in a given metagenome. To overcome these problems, this work proposes an open-source, bioinfor-matic tool called GCSplit, which partitions metagenomic sequences into subsets using a computationally inexpensive metric: the GC content. Experiments performed on real data show that preprocessing short reads with GCSplit prior to assembly reduces memory consumption and generates higher quality results, such as an increase in the N50 metric and the reduction in both the L50 value and the total number of contigs produced in the assembly. GCSplit is available at https://github.com/mirand863/gcsplit.

[1]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[2]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[3]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[4]  Mark J. Bailey,et al.  TerraGenome: a consortium for the sequencing of a soil metagenome , 2009, Nature Reviews Microbiology.

[5]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[6]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[7]  Marcel H. Schulz,et al.  In silico read normalization using set multi-cover optimization , 2017, Bioinform..

[8]  J. Spear,et al.  Draft Genome of a Novel Chlorobi Member Assembled by Tetranucleotide Binning of a Hot Spring Metagenome , 2014, Genome Announcements.

[9]  F. Collart,et al.  Environment sensing and response mediated by ABC transporters , 2011, BMC Genomics.

[10]  R. Knight,et al.  The human microbiome project: exploring the microbial part of ourselves in a changing world , 2022 .

[11]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[12]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[13]  Huzefa Rangwala,et al.  MC-MinH: Metagenome Clustering using Minwise based Hashing , 2013, SDM.

[14]  Sallie W. Chisholm,et al.  Unlocking Short Read Sequencing for Metagenomics , 2010, PloS one.

[15]  F. Ibarbalz,et al.  Shotgun Metagenomic Profiles Have a High Capacity To Discriminate Samples of Activated Sludge According to Wastewater Type , 2016, Applied and Environmental Microbiology.

[16]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[17]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[18]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[19]  Páll Melsted,et al.  KmerStream: Streaming algorithms for k-mer abundance estimation , 2014, bioRxiv.

[20]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[21]  Huzefa Rangwala,et al.  Evaluation of short read metagenomic assembly , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[22]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[23]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[24]  Paul Medvedev,et al.  Parallel and Memory-Efficient Preprocessing for Metagenome Assembly , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[25]  Robert Nowak,et al.  Genomes correction and assembling: present methods and tools , 2014, Other Conferences.

[26]  Derrick E. Fouts,et al.  NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly , 2014, BMC Bioinformatics.