A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events

Motivation Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or fine-assessment of marker effect. Recently, alignment-free methods based on kmer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are hard to interpret. Methods Here, we introduce DBGWAS, an extended kmer-based GWAS method producing interpretable genetic variants associated with pheno-types. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes identified by the association model into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is fast, alignment-free and only requires a set of contigs and phenotypes. It produces annotated subgraphs representing local polymorphisms as well as mobile genetic elements (MGE) and offers a graphical framework to interpret GWAS results. Results We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa – along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. Conclusion Our novel method proved its efficiency to retrieve any type of phenotype-associated genetic variant without prior knowledge. All experiments were computed in less than two hours and produced a compact set of meaningful subgraphs, thereby outperforming other GWAS approaches and facilitating the interpretation of the results. Availability Open-source tool available at https://gitlab.com/leoisl/dbgwas

[1]  P. Lambert,et al.  Mechanisms of antibiotic resistance in Pseudomonas aeruginosa. , 2002, Journal of the Royal Society of Medicine.

[2]  Daniel Falush,et al.  Genome-wide association mapping in bacteria? , 2006, Trends in microbiology.

[3]  Magali Jaillard,et al.  Microbial genomics and antimicrobial susceptibility testing , 2017, Expert review of molecular diagnostics.

[4]  Bruno Pot,et al.  Pseudomonas aeruginosa Population Structure Revisited , 2009, PloS one.

[5]  W. Hanage,et al.  Comprehensive Identification of Single Nucleotide Polymorphisms Associated with Beta-lactam Resistance within Pneumococcal Mosaic Genes , 2014, PLoS genetics.

[6]  F. Lowy Antimicrobial resistance: the example of Staphylococcus aureus. , 2003, The Journal of clinical investigation.

[7]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[8]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M. Delarue,et al.  Structural Insights into the Quinolone Resistance Mechanism of Mycobacterium tuberculosis DNA Gyrase , 2010, PloS one.

[10]  Jukka Corander,et al.  Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes , 2016, Nature Communications.

[11]  J. Rolain,et al.  ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes , 2013, Antimicrobial Agents and Chemotherapy.

[12]  Jean-Baptiste Veyrieras,et al.  Correlation between phenotypic antibiotic susceptibility and the resistome in Pseudomonas aeruginosa. , 2017, International journal of antimicrobial agents.

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  Keith A. Jolley,et al.  Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter , 2013, Proceedings of the National Academy of Sciences.

[15]  Francesc Coll,et al.  Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences , 2015, Genome Medicine.

[16]  M. Webber,et al.  Molecular mechanisms of antibiotic resistance , 2014, Nature Reviews Microbiology.

[17]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[18]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[19]  Daniel J. Wilson,et al.  Prediction of Staphylococcus aureus Antimicrobial Resistance by Whole-Genome Sequencing , 2014, Journal of Clinical Microbiology.

[20]  J. Jeljaszewicz,et al.  The genome of Staphylococcus aureus: a review. , 1998, Zentralblatt fur Bakteriologie : international journal of medical microbiology.

[21]  Olivier Bornet,et al.  Structural and functional insights into the periplasmic detector domain of the GacS histidine kinase controlling biofilm formation in Pseudomonas aeruginosa , 2017, Scientific Reports.

[22]  C. Landry,et al.  Transcriptome sequences spanning key developmental states as a resource for the study of the cestode Schistocephalus solidus, a threespine stickleback parasite , 2016, GigaScience.

[23]  Sebastian M. Gygli,et al.  Antimicrobial resistance in Mycobacterium tuberculosis: mechanistic and evolutionary perspectives , 2017, FEMS microbiology reviews.

[24]  Paramasamy Gunasekaran,et al.  Genome Sequencing of a Mung Bean Plant Growth Promoting Strain of P. aeruginosa with Biocontrol Ability , 2014, International journal of genomics.

[25]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[26]  I. Bastian,et al.  Detection of rifampicin resistance in Mycobacterium tuberculosis isolates from diverse countries by a commercial line probe assay as an initial indicator of multidrug resistance. , 2000, The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[27]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[28]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[29]  Dongfang Li,et al.  Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance , 2013, Nature Genetics.

[30]  Vincent Lacroix,et al.  Representing Genetic Determinants in Bacterial GWAS with Compacted De Bruijn Graphs , 2017, bioRxiv.

[31]  Shiru Jia,et al.  A Site-Specific Integrative Plasmid Found in Pseudomonas aeruginosa Clinical Isolate HS87 along with A Plasmid Carrying an Aminoglycoside-Resistant Gene , 2016, PloS one.

[32]  Gary D. Bader,et al.  Cytoscape.js: a graph theory library for visualisation and analysis , 2015, Bioinform..

[33]  Rida Assaf,et al.  Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center , 2016, Nucleic Acids Res..

[34]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[35]  D. Hougaard,et al.  Prevalence of erm gene classes in erythromycin-resistant Staphylococcus aureus strains isolated between 1959 and 1988 , 1995, Antimicrobial agents and chemotherapy.

[36]  Tulio de Oliveira,et al.  Microbial genome-wide association studies: lessons from human GWAS , 2016, Nature Reviews Genetics.

[37]  Karen N. Conneely,et al.  Dissecting Vancomycin-Intermediate Resistance in Staphylococcus aureus Using Genome-Wide Association , 2014, Genome biology and evolution.

[38]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[39]  Jean-Pierre Flandrois,et al.  MUBII-TB-DB: a database of mutations associated with antibiotic resistance in Mycobacterium tuberculosis , 2014, BMC Bioinformatics.

[40]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[41]  Lior Pachter,et al.  Association mapping from sequencing reads using k-mers , 2017, bioRxiv.

[42]  Christina Boucher,et al.  MEGARes: an antimicrobial resistance database for high throughput sequencing , 2016, Nucleic Acids Res..

[43]  G. Jacoby,et al.  Mechanisms of drug resistance: quinolone resistance , 2015, Annals of the New York Academy of Sciences.

[44]  Rayan Chikhi,et al.  Fast and scalable minimal perfect hashing for massive key sets , 2017, SEA.

[45]  Vincent Miele,et al.  Colib’read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads , 2016, GigaScience.

[46]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[47]  Razvan Sultana,et al.  Genomic Analysis Identifies Targets of Convergent Positive Selection in Drug Resistant Mycobacterium tuberculosis , 2013, Nature Genetics.

[48]  J H Lee,et al.  Exclusive mutations related to isoniazid and ethionamide resistance among Mycobacterium tuberculosis isolates from Korea. , 2000, The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[49]  Jean-Baptiste Veyrieras,et al.  Phylogenetic Distribution of CRISPR-Cas Systems in Antibiotic-Resistant Pseudomonas aeruginosa , 2015, mBio.

[50]  S. Rasmussen,et al.  Identification of acquired antimicrobial resistance genes , 2012, The Journal of antimicrobial chemotherapy.

[51]  J. Palomino,et al.  Drug Resistance Mechanisms in Mycobacterium tuberculosis , 2014, Antibiotics.

[52]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[53]  Classification of Staphylococcal Cassette Chromosome mec ( SCC mec ) : Guidelines for Reporting Novel SCC mec Elements † International Working Group on the Classification of Staphylococcal Cassette Chromosome Elements ( IWG-SCC ) * , 2009 .

[54]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[55]  David A. Clifton,et al.  Identifying lineage effects when controlling for population structure improves power in bacterial association studies , 2015, Nature Microbiology.

[56]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[57]  Christian Stolte,et al.  Genetic Determinants of Drug Resistance in Mycobacterium tuberculosis and Their Diagnostic Value. , 2016, American journal of respiratory and critical care medicine.

[58]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[59]  Egon A. Ozer,et al.  The Accessory Genome of Pseudomonas aeruginosa , 2010, Microbiology and Molecular Biology Reviews.

[60]  Timothy D Read,et al.  Characterizing the genetic basis of bacterial phenotypes using genome-wide association studies: a new direction for bacteriology , 2014, Genome Medicine.

[61]  Justin Chu,et al.  ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter , 2016, bioRxiv.

[62]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[63]  D. Heckerman,et al.  Further Improvements to Linear Mixed Models for Genome-Wide Association Studies , 2014, Scientific Reports.

[64]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[65]  Fangfang Xia,et al.  Antimicrobial Resistance Prediction in PATRIC and RAST , 2016, Scientific Reports.

[66]  T Lambert,et al.  A spontaneous point mutation in the aac(6')-Ib' gene results in altered substrate specificity of aminoglycoside 6'-N-acetyltransferase of a Pseudomonas fluorescens strain. , 1994, FEMS microbiology letters.