CAMSA: a Tool for Comparative Analysis and Merging of Scaffold Assemblies

Motivation Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually. Results We present CAMSA—a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the most confident merged scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies. Among the CAMSA features, only scaffold merging can be evaluated in comparison to existing methods. Namely, it resembles the functionality of assembly reconciliation tools, although their primary targets are somewhat different. Our evaluations show that CAMSA produces merged assemblies of comparable or better quality than existing assembly reconciliation tools while being the fastest in terms of the total running time. Availability CAMSA is distributed under the MIT license and is available at http://cblab.org/camsa/.

[1]  Max A. Alekseyev,et al.  Multi-genome Scaffold Co-assembly Based on the Analysis of Gene Orders and Genomic Repeats , 2016, ISBRA.

[2]  B. Stollar,et al.  High resolution detection of DNA–RNA hybrids in situ by indirect immunofluorescence , 1977, Nature.

[3]  Steven J. M. Jones,et al.  LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads , 2015, GigaScience.

[4]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[5]  Annie Chateau,et al.  Ancestral gene synteny reconstruction improves extant species scaffolding , 2015, bioRxiv.

[6]  Gary D. Bader,et al.  Cytoscape.js: a graph theory library for visualisation and analysis , 2015, Bioinform..

[7]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[8]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[9]  Michael R. Speicher,et al.  The new cytogenetics: blurring the boundaries with molecular biology , 2005, Nature Reviews Genetics.

[10]  Emek Demir,et al.  A layout algorithm for undirected compound graphs , 2009, Inf. Sci..

[11]  David Tse,et al.  FinisherSC : A repeat-aware tool for upgrading de-novo assembly using long reads , 2014, bioRxiv.

[12]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[13]  Alberto Policriti,et al.  GAM-NGS: genomic assemblies merger for next generation sequencing , 2013, BMC Bioinformatics.

[14]  R. Giroudeau,et al.  A complexity and approximation framework for the maximization scaffolding problem , 2015, Theor. Comput. Sci..

[15]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[16]  M. Schatz,et al.  Metassembler: merging and optimizing de novo genome assemblies , 2015, Genome Biology.

[17]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[18]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[19]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[20]  Peng Cui,et al.  Dynamic regulation of genome-wide pre-mRNA splicing and stress tolerance by the Sm-like protein LSm5 in Arabidopsis , 2014, Genome Biology.

[21]  F. Vezzi,et al.  e-RGA: enhanced Reference Guided Assembly of Complex Genomes , 2011 .

[22]  Marcel J. T. Reinders,et al.  GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies , 2012, Bioinform..

[23]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[24]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[25]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[26]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[27]  Douglas R. Smith,et al.  Assembly reconciliation , 2008, Bioinform..

[28]  L. M. Soto-Jiménez,et al.  GARM: genome assembly, reconciliation and merging pipeline. , 2014, Current topics in medicinal chemistry.

[29]  James E. Allen,et al.  Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes , 2014, Science.

[30]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[31]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[32]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[33]  James C. Schnable,et al.  ALLMAPS: robust scaffold ordering based on multiple maps , 2015, Genome Biology.

[34]  Igor Mandric,et al.  ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching , 2015, RECOMB.

[35]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[36]  Walter Pirovano,et al.  SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information , 2014, BMC Bioinformatics.

[37]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[38]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[39]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[40]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[41]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[42]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[43]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[44]  Yeisoo Yu,et al.  Uncovering the novel characteristics of Asian honey bee, Apis cerana, by whole genome sequencing , 2015, BMC Genomics.

[45]  Ilan Newman,et al.  Approximation algorithms for covering a graph by vertex-disjoint paths of maximum total weight , 1990, Networks.

[46]  Guohui Yao,et al.  Graph accordance of next-generation sequence assemblies , 2012, Bioinform..

[47]  Shuai Jiang,et al.  Reconstruction of ancestral genomes in presence of gene gain and loss , 2016, bioRxiv.

[48]  Mohammed-Amin Madoui,et al.  MaGuS: a tool for quality assessment and scaffolding of genome assemblies with Whole Genome Profiling™ Data , 2016, BMC Bioinformatics.