论文信息 - Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

[1] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[2] S. Das,et al. IS6110 restriction fragment length polymorphism typing of clinical isolates of Mycobacterium tuberculosis from patients with pulmonary tuberculosis in Madras, south India. , 1995, Tubercle and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[3] D. Briles,et al. Pneumococcal Surface Protein A Inhibits Complement Activation by Streptococcus pneumoniae , 1999, Infection and Immunity.

[4] Thomas L. Madden,et al. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[5] S. Salzberg,et al. Complete Genome Sequence of a Virulent Isolate of Streptococcus pneumoniae , 2001, Science.

[6] S. Salzberg,et al. Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[7] George Newport,et al. The diploid genome sequence of Candida albicans. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8] Arthur Rowe,et al. Solution structure of choline binding protein A, the major adhesin of Streptococcus pneumoniae , 2005, The EMBO journal.

[9] Rodrigo Lopez,et al. Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[10] Andres Cubillos-Ruiz,et al. Analysis of the genetic variation in Mycobacterium tuberculosis strains by multiple genome alignments , 2008, BMC Research Notes.

[11] A. Mazza,et al. Frequent Homologous Recombination Events in Mycobacterium tuberculosis PE/PPE Multigene Families: Potential Role in Antigenic Variability , 2008, Journal of bacteriology.

[12] Yvan Saeys,et al. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles , 2008, ISMB.

[13] Manuel A. S. Santos,et al. Evolution of pathogenicity and sexual reproduction in eight Candida genomes , 2009, Nature.

[14] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16] R. Wilson,et al. BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[17] M. DePristo,et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[18] A. Gnirke,et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[19] M. Berriman,et al. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.