PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.

[1]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[2]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[3]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[6]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[7]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[8]  Vincent Montoya,et al.  Metagenomics for pathogen detection in public health , 2013, Genome Medicine.

[9]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[10]  Simon H. Tausch,et al.  RAMBO-K: Rapid and Sensitive Removal of Background Sequences from Next Generation Sequencing Data , 2015, PloS one.

[11]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[12]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[13]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[14]  Peng Sun,et al.  Density parameter estimation for finding clusters of homologous proteins - tracing actinobacterial pathogenicity lifestyles , 2013, Bioinform..

[15]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[16]  Stephen J. Salipante,et al.  A Year of Infection in the Intensive Care Unit: Prospective Whole Genome Sequencing of Bacterial Clinical Isolates Reveals Cryptic Transmissions and Novel Microbiota , 2015, PLoS genetics.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[19]  Taghi M. Khoshgoftaar,et al.  Software quality modeling: The impact of class noise on the random forest classifier , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[20]  Peter Holland,et al.  Read classification for next generation sequencing , 2013, ESANN.

[21]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[22]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[23]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.

[24]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[25]  T. Creighton Proteins: Structures and Molecular Properties , 1986 .

[26]  Anne-Christin Hauschild,et al.  On the limits of computational functional genomics for bacterial lifestyle prediction , 2014, GCB.

[27]  A. Mchardy,et al.  The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences , 2012, PloS one.

[28]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[29]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[30]  M. Juhas Horizontal gene transfer in human pathogens , 2015, Critical reviews in microbiology.

[31]  Quinn Snell,et al.  Pathoscope: Species identification and strain attribution with unassembled sequencing data , 2013, Genome research.

[32]  L. Patthy Genome evolution and the evolution of exon-shuffling--a review. , 1999, Gene.

[33]  I-Min A. Chen,et al.  IMG 4 version of the integrated microbial genomes comparative analysis system , 2013, Nucleic Acids Res..

[34]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[35]  Ole Lund,et al.  PathogenFinder - Distinguishing Friend from Foe Using Bacterial Whole Genome Sequence Data , 2013, PloS one.

[36]  Changjin Hong,et al.  PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples , 2014, Microbiome.

[37]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[38]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  G Greub,et al.  Emerging bacterial pathogens: the past and beyond , 2015, Clinical Microbiology and Infection.

[41]  Taghi M. Khoshgoftaar,et al.  Identifying learners robust to low quality data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[42]  Robert Schlaberg,et al.  A Systematic Approach for Discovering Novel, Clinically Relevant Bacteria , 2012, Emerging infectious diseases.

[43]  Bernhard Y. Renard,et al.  Metagenomic Profiling of Known and Unknown Microbes with MicrobeGPS , 2015, PloS one.

[44]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome Research.

[45]  Didier Raoult,et al.  Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than virulence factors. , 2013, Briefings in functional genomics.

[46]  Gustavo E. Vazquez,et al.  Reduced Set of Virulence Genes Allows High Accuracy Prediction of Bacterial Pathogenicity in Humans , 2012, PloS one.

[47]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[48]  Gary Benson,et al.  Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data , 2014, BMC Bioinformatics.

[49]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[50]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[51]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.