Assessing host-specificity of Escherichia coli using a supervised learning logic-regression-based analysis of single nucleotide polymorphisms in intergenic regions.

Host specificity in E. coli is widely debated. Herein, we used supervised learning logic-regression-based analysis of intergenic DNA sequence variability in E. coli in an attempt to identify single nucleotide polymorphism (SNP) biomarkers of E. coli that are associated with natural selection and evolution toward host specificity. Seven-hundred and eighty strains of E. coli were isolated from 15 different animal hosts. We utilized logic regression for analyzing DNA sequence data of three intergenic regions (flanked by the genes uspC-flhDC, csgBAC-csgDEFG, and asnS-ompF) to identify genetic biomarkers that could potentially discriminate E. coli based on host sources. Across 15 different animal hosts, logic regression successfully discriminated E. coli based on animal host source with relatively high specificity (i.e., among the samples of the non-target animal host, the proportion that correctly did not have the host-specific marker pattern) and sensitivity (i.e., among the samples from a given animal host, the proportion that correctly had the host-specific marker pattern), even after fivefold cross validation. Permutation tests confirmed that for most animals, host specific intergenic biomarkers identified by logic regression in E. coli were significantly associated with animal host source. The highest level of biomarker sensitivity was observed in deer isolates, with 82% of all deer E. coli isolates displaying a unique SNP pattern that was 98% specific to deer. Fifty-three percent of human isolates displayed a unique biomarker pattern that was 98% specific to humans. Twenty-nine percent of cattle isolates displayed a unique biomarker that was 97% specific to cattle. Interestingly, even within a related host group (i.e., Family: Canidae [domestic dogs and coyotes]), highly specific SNP biomarkers (98% and 99% specificity for dog and coyotes, respectively) were observed, with 21% of dog E. coli isolates displaying a unique dog biomarker and 61% of coyote isolates displaying a unique coyote biomarker. Application of a supervised learning method, such as logic regression, to DNA sequence analysis at certain intergenic regions demonstrates that some E. coli strains may evolve to become host-specific.

[1]  Olivier Tenaillon,et al.  The population genetics of commensal Escherichia coli , 2010, Nature Reviews Microbiology.

[2]  Ingo Ruczinski,et al.  Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications , 2004 .

[3]  Katsushi Tokunaga,et al.  SNP-SNP Interactions Discovered by Logic Regression Explain Crohn's Disease Genetics , 2012, PloS one.

[4]  A. Mazumder,et al.  Differentiation of fecal Escherichia coli from poultry and free-living birds by (GTG)5-PCR genomic fingerprinting. , 2008, International journal of medical microbiology : IJMM.

[5]  S. Kathariou,et al.  Identification of host-associated alleles by multilocus sequence typing of Campylobacter coli strains from food animals. , 2006, Microbiology.

[6]  E. Ron Host specificity of septicemic Escherichia coli: human and avian pathogens. , 2006, Current opinion in microbiology.

[7]  P. Schofield,et al.  DNA sequence of Rhizobium trifolii nodulation genes reveals a reiterated and potentially regulatory sequence preceding nodABC and nodFE. , 1986, Nucleic acids research.

[8]  G. Karnam,et al.  Differentially Evolved Genes of Salmonella Pathogenicity Islands: Insights into the Mechanism of Host Specificity in Salmonella , 2008, PloS one.

[9]  Y. Tsai,et al.  A biomarker for the identification of swine fecal pollution in water, using the STII toxin gene from enterotoxigenic Escherichia coli , 2002, Applied Microbiology and Biotechnology.

[10]  H. Saedler,et al.  Heterotopic expression of MPF2 is the key to the evolution of the Chinese lantern of Physalis, a morphological novelty in Solanaceae. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Peer Bork,et al.  Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy , 2011, Nucleic Acids Res..

[12]  T. Furukawa,et al.  Application of PFGE to source tracking of faecal pollution in coastal recreation area: a case study in Aoshima Beach, Japan , 2011, Journal of applied microbiology.

[13]  E. Topp,et al.  Optimization and validation of rep-PCR genotypic libraries for microbial source tracking of environmental Escherichia coli isolates. , 2010, Canadian journal of microbiology.

[14]  O. Tenaillon,et al.  Evidence for a human-specific Escherichia coli clone. , 2008, Environmental microbiology.

[15]  Una Ryan,et al.  Cryptosporidium Taxonomy: Recent Advances and Implications for Public Health , 2004, Clinical Microbiology Reviews.

[16]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[17]  Oscar P. Kuipers,et al.  Phenotypic variation in bacteria: the role of feedback regulation , 2006, Nature Reviews Microbiology.

[18]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[19]  D. Ussery,et al.  Comparison of 61 Sequenced Escherichia coli Genomes , 2010, Microbial Ecology.

[20]  M. Fauvart,et al.  Rhizobial secreted proteins as determinants of host specificity in the rhizobium-legume symbiosis. , 2008, FEMS microbiology letters.

[21]  U. Alon,et al.  A comprehensive library of fluorescent transcriptional reporters for Escherichia coli , 2006, Nature Methods.

[22]  J. Poveda,et al.  Mycoplasma buteonis sp. nov., Mycoplasma falconis sp. nov., and Mycoplasma gypis sp. nov., three species from birds of prey , 1994 .

[23]  T. Bruns,et al.  Evidence for mycorrhizal races in a cheating orchid , 2004, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[24]  Laurent Briollais,et al.  SNP-SNP interactions in breast cancer susceptibility , 2006, BMC Cancer.

[25]  Jerry D. Davis,et al.  Microbial source tracking by DNA sequence analysis of the Escherichia coli malate dehydrogenase gene. , 2006, Journal of microbiological methods.

[26]  M. Pires,et al.  Escherichia coli phylogenetic group determination and its application in the identification of the major animal source of fecal contamination , 2010, BMC Microbiology.

[27]  M. Gilmour,et al.  Genetic Determinants and Polymorphisms Specific for Human-Adapted Serovars of Salmonella enterica That Cause Enteric Fever , 2006, Journal of Clinical Microbiology.

[28]  Peer Bork,et al.  Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation , 2007, Bioinform..

[29]  J. A. Stuedemann,et al.  Putative temporal variability of Escherichia coli ribotypes from yearling steers. , 2003, Journal of environmental quality.

[30]  E. Ruby,et al.  A single regulatory gene is sufficient to alter bacterial host range , 2009, Nature.

[31]  M. Ellersieck,et al.  Comparison of Ribotyping and Repetitive Extragenic Palindromic-PCR for Identification of Fecal Escherichia coli from Humans and Animals , 2003, Applied and Environmental Microbiology.

[32]  Jianwen Fang,et al.  Sequence-based source tracking of Escherichia coli based on genetic diversity of beta-glucuronidase. , 2004, Journal of environmental quality.

[33]  Matthew W. Hahn,et al.  The evolution of transcriptional regulation in eukaryotes. , 2003, Molecular biology and evolution.

[34]  Bertrand Picard,et al.  Animal and human pathogenic Escherichia coli strains share common genetic backgrounds. , 2011, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[35]  D. Gordon The genetic structure of Escherichia coli populations in feral house mice. , 1997, Microbiology.

[36]  Michael J. Sadowsky,et al.  Use of Repetitive DNA Sequences and the PCR To DifferentiateEscherichia coli Isolates from Human and Animal Sources , 2000, Applied and Environmental Microbiology.

[37]  Bruce A Wiggins,et al.  Comparison of seven protocols to identify fecal contamination sources using Escherichia coli. , 2004, Environmental science & technology.

[38]  B. Levin,et al.  Genetic diversity and temporal variation in the E. coli population of a human host. , 1981, Genetics.

[39]  M. Ellersieck,et al.  Identification of Fecal Escherichia colifrom Humans and Animals by Ribotyping , 2001, Applied and Environmental Microbiology.

[40]  E. Topp,et al.  A comparison of AFLP and ERIC-PCR analyses for discriminating Escherichia coli from cattle, pig and human sources. , 2004, FEMS microbiology ecology.

[41]  M. Surette,et al.  Intergenic Sequence Comparison of Escherichia coli Isolates Reveals Lifestyle Adaptations but Not Host Specificity , 2011, Applied and Environmental Microbiology.

[42]  Thomas A Edge,et al.  Multiple lines of evidence to identify the sources of fecal pollution at a freshwater beach in Hamilton Harbour, Lake Ontario. , 2007, Water research.

[43]  A. Benson,et al.  Octamer-based genome scanning distinguishes a unique subpopulation of Escherichia coli O157:H7 strains in cattle. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[44]  C. Sensen,et al.  Molecular and phylogenetic approaches for assessing sources of Cryptosporidium contamination in water. , 2012, Water research.

[45]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[46]  Enzo Cocca,et al.  Lakeside Cemeteries in the Sahara: 5000 Years of Holocene Population and Environmental Change , 2008, PloS one.

[47]  Shiao Y Wang,et al.  Methods To Increase Fidelity of Repetitive Extragenic Palindromic PCR Fingerprint-Based Bacterial Source Tracking Efforts , 2005, Applied and Environmental Microbiology.