Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  S. Das,et al.  IS6110 restriction fragment length polymorphism typing of clinical isolates of Mycobacterium tuberculosis from patients with pulmonary tuberculosis in Madras, south India. , 1995, Tubercle and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[3]  D. Briles,et al.  Pneumococcal Surface Protein A Inhibits Complement Activation by Streptococcus pneumoniae , 1999, Infection and Immunity.

[4]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[5]  S. Salzberg,et al.  Complete Genome Sequence of a Virulent Isolate of Streptococcus pneumoniae , 2001, Science.

[6]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[7]  George Newport,et al.  The diploid genome sequence of Candida albicans. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Arthur Rowe,et al.  Solution structure of choline binding protein A, the major adhesin of Streptococcus pneumoniae , 2005, The EMBO journal.

[9]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[10]  Andres Cubillos-Ruiz,et al.  Analysis of the genetic variation in Mycobacterium tuberculosis strains by multiple genome alignments , 2008, BMC Research Notes.

[11]  A. Mazza,et al.  Frequent Homologous Recombination Events in Mycobacterium tuberculosis PE/PPE Multigene Families: Potential Role in Antigenic Variability , 2008, Journal of bacteriology.

[12]  Yvan Saeys,et al.  ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles , 2008, ISMB.

[13]  Manuel A. S. Santos,et al.  Evolution of pathogenicity and sexual reproduction in eight Candida genomes , 2009, Nature.

[14]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[17]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[18]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[19]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[20]  S. Fortune,et al.  Variation among Genome Sequences of H37Rv Strains of Mycobacterium tuberculosis from Multiple Laboratories , 2010, Journal of bacteriology.

[21]  Matthew Berriman,et al.  Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology , 2010, Bioinform..

[22]  Y. Saeys,et al.  GenomeView: a next-generation genome browser , 2011, Nucleic acids research.

[23]  Hamidreza Chitsaz,et al.  SEQuel: improving the accuracy of genome assemblies , 2012, Bioinform..

[24]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[25]  Alexander Schliep,et al.  CLEVER: clique-enumerating variant finder , 2012, Bioinform..

[26]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[27]  R. Wilkinson,et al.  Conserved Immune Recognition Hierarchy of Mycobacterial PE/PPE Proteins during Infection in Natural Hosts , 2012, PloS one.

[28]  A. Gnirke,et al.  Paired-end sequencing of Fosmid libraries by Illumina , 2012, Genome research.

[29]  Eric S. Lander,et al.  Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011 , 2012, Proceedings of the National Academy of Sciences.

[30]  Alberto Policriti,et al.  GapFiller: a de novo assembly approach to fill the gap within paired reads , 2012, BMC Bioinformatics.

[31]  P. V. van Helden,et al.  Comparative Analysis of Mycobacterium tuberculosis pe and ppe Genes Reveals High Sequence Variation and an Apparent Absence of Selective Constraints , 2012, PloS one.

[32]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[33]  Samuel A. Assefa,et al.  A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs , 2012, Nature Protocols.

[34]  Karina Yusim,et al.  Mycobacterium tuberculosis--heterogeneity revealed through whole genome sequencing. , 2012, Tuberculosis.

[35]  Seyed E. Hasnain,et al.  Comparative genomic and proteomic analyses of PE/PPE multigene family of Mycobacterium tuberculosis H37Rv and H37Ra reveal novel and interesting differences with implications in virulence , 2012, Nucleic acids research.

[36]  B. Birren,et al.  Independent Large Scale Duplications in Multiple M. tuberculosis Lineages Overlapping the Same Genomic Region , 2012, PloS one.

[37]  K. Holt,et al.  Out-of-Africa migration and Neolithic co-expansion of Mycobacterium tuberculosis with modern humans , 2013, Nature Genetics.

[38]  Alberto Policriti,et al.  GAM-NGS: genomic assemblies merger for next generation sequencing , 2013, BMC Bioinformatics.

[39]  Sara El-Metwally,et al.  Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges , 2013, PLoS Comput. Biol..

[40]  Gavin Sherlock,et al.  Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure , 2013, Genome Biology.

[41]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[42]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[43]  M. Lipsitch,et al.  Population genomics of post-vaccine changes in pneumococcal epidemiology , 2013, Nature Genetics.

[44]  Julian Parkhill,et al.  Genomic epidemiology of Neisseria gonorrhoeae with reduced susceptibility to cefixime in the USA: a retrospective observational study , 2014, The Lancet. Infectious diseases.

[45]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[46]  Jukka Corander,et al.  Dense genomic sampling identifies highways of pneumococcal recombination , 2014, Nature Genetics.