Genome puzzle master (GPM): an integrated pipeline for building and editing pseudomolecules from fragmented sequences

Abstract Motivation: Next generation sequencing technologies have revolutionized our ability to rapidly and affordably generate vast quantities of sequence data. Once generated, raw sequences are assembled into contigs or scaffolds. However, these assemblies are mostly fragmented and inaccurate at the whole genome scale, largely due to the inability to integrate additional informative datasets (e.g. physical, optical and genetic maps). To address this problem, we developed a semi-automated software tool—Genome Puzzle Master (GPM)—that enables the integration of additional genomic signposts to edit and build ‘new-gen-assemblies’ that result in high-quality ‘annotation-ready’ pseudomolecules. Results: With GPM, loaded datasets can be connected to each other via their logical relationships which accomplishes tasks to ‘group,’ ‘merge,’ ‘order and orient’ sequences in a draft assembly. Manual editing can also be performed with a user-friendly graphical interface. Final pseudomolecules reflect a user’s total data package and are available for long-term project management. GPM is a web-based pipeline and an important part of a Laboratory Information Management System (LIMS) which can be easily deployed on local servers for any genome research laboratory. Availability and Implementation: The GPM (with LIMS) package is available at https://github.com/Jianwei-Zhang/LIMS Contacts: jzhang@mail.hzau.edu.cn or rwing@mail.arizona.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[2]  Jan van Oeveren,et al.  Sequence-based physical mapping of complex genomes by whole genome profiling. , 2011, Genome research.

[3]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[4]  Rod A Wing,et al.  Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63 , 2016, Proceedings of the National Academy of Sciences.

[5]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[6]  Thomas M. Keane,et al.  ABACAS: algorithm-based automatic contiguation of assembled sequences , 2009, Bioinform..

[7]  James C. Schnable,et al.  ALLMAPS: robust scaffold ordering based on multiple maps , 2015, Genome Biology.

[8]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[9]  Galina Fuks,et al.  Whole-Genome Validation of High-Information-Content Fingerprinting1 , 2005, Plant Physiology.

[10]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[11]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[12]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[13]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[14]  D. Schwartz,et al.  Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data , 2013, Rice.

[15]  T. Graves,et al.  The Physical and Genetic Framework of the Maize B73 Genome , 2009, PLoS genetics.

[16]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[17]  R. Wing,et al.  Building two indica rice reference genomes with PacBio long-read and Illumina paired-end sequencing data , 2016, Scientific Data.

[18]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[19]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[20]  Aaron E. Darling,et al.  Reordering contigs of draft genomes using the Mauve Aligner , 2009, Bioinform..

[21]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.