Resolving the structural features of genomic islands: a machine learning approach.

Large inserts of horizontally acquired DNA that contain functionally related genes with limited phylogenetic distribution are often referred to as genomic islands (GIs), and structural definitions of these islands, based on common features, have been proposed. Although a large number of mobile elements fall well within the GI definition, there are several concerns about the structural consensus for GIs: The current GI definition was put forward 10 yr ago when only 12 complete bacterial genomes were available, a large number of GIs deviate from that definition, and in silico predictions assuming a full/partial GI structural model bias the sampling of the GI structural space toward "well-structured" GIs. In this study, the structural features of genomic regions are sampled by a hypothesis-free, bottom-up search, and these are exploited in a machine learning approach with the aim of explicitly quantifying and modeling the contribution of each feature to the GI structure. Performing a whole-genome-based comparative analysis between 37 strains of three different genera and 12 outgroup genomes, 668 genomic regions were sampled and used to train structural GI models. The data show that, overall, GIs from the three different genera fall into distinct, genus-specific structural families. However, decreasing the taxa resolution, by studying GI structures across different genus boundaries, provides models that converge on a fairly similar GI structure, further suggesting that GIs can be seen as a superfamily of mobile elements, with core and variable structural features, rather than a well-defined family.

[1]  T. D. Read,et al.  Role of Mobile DNA in the Evolution of Vancomycin-Resistant Enterococcus faecalis , 2003, Science.

[2]  Georgios S. Vernikos,et al.  Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands , 2006, Bioinform..

[3]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[4]  S. Salzberg,et al.  Complete Genome Sequence of a Virulent Isolate of Streptococcus pneumoniae , 2001, Science.

[5]  G. S. Chhatwal,et al.  Fibronectin-Binding Protein Gene Recombination and Horizontal Transfer between Group A and G Streptococci , 2004, Journal of Clinical Microbiology.

[6]  R. Schoenfeld,et al.  Comparative Genomics of Listeria Species , 1976 .

[7]  V. Fischetti In vivo acquisition of prophage in Streptococcus pyogenes. , 2007, Trends in microbiology.

[8]  R. Durbin,et al.  Vertebrate gene finding from multiple-species alignments using a two-level strategy , 2006, Genome Biology.

[9]  J. C. Waterhouse,et al.  Dispensable genes and foreign DNA in Streptococcus mutans. , 2006, Microbiology.

[10]  Michal J. Nagiec,et al.  Molecular genetic anatomy of inter- and intraserotype variation in the human bacterial pathogen group A Streptococcus. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Ehrlich,et al.  The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403. , 2001, Genome research.

[12]  H. Ochman,et al.  Molecular archaeology of the Escherichia coli genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Elliot J. Lefkowitz,et al.  Genome of the Bacterium Streptococcus pneumoniae Strain R6 , 2001, Journal of bacteriology.

[14]  Georgios S. Vernikos,et al.  Genetic flux over time in the Salmonella lineage , 2007, Genome Biology.

[15]  J Hacker,et al.  Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution , 1997, Molecular microbiology.

[16]  H. Ochman,et al.  Amelioration of Bacterial Genomes: Rates of Change and Exchange , 1997, Journal of Molecular Evolution.

[17]  B. Barrell,et al.  Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  G. Sensabaugh,et al.  Complete genome sequence of USA300, an epidemic clone of community-acquired meticillin-resistant Staphylococcus aureus , 2006, The Lancet.

[19]  A. Goffeau,et al.  Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus , 2004, Nature Biotechnology.

[20]  Jie Dong,et al.  Genome sequence of Shigella flexneri 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157. , 2002, Nucleic acids research.

[21]  J. Hacker,et al.  Pathogenicity Islands and the Evolution of Pathogenic Microbes , 2002, Current Topics in Microbiology and Immunology.

[22]  J. Hacker,et al.  Pathogenicity islands and the evolution of microbes. , 2000, Annual review of microbiology.

[23]  J. Hacker,et al.  Pathogenicity islands and other mobile virulence elements , 1999 .

[24]  Tim J. P. Hubbard,et al.  A machine learning strategy to identify candidate binding sites in human protein-coding sequence , 2005, BMC Bioinformatics.

[25]  A. Danchin,et al.  Genome‐based analysis of virulence genes in a non‐biofilm‐forming Staphylococcus epidermidis strain (ATCC 12228) , 2003, Molecular microbiology.

[26]  M. Kleerebezem,et al.  Complete genome sequence of Lactobacillus plantarum WCFS1 , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Skurnik,et al.  Molecular and chemical characterization of the lipopolysaccharide O‐antigen and its role in the virulence of Yersinia enterocolitica serotype O:8 , 1997, Molecular microbiology.

[28]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[29]  Steven J. M. Jones,et al.  IslandPath: aiding detection of genomic islands in prokaryotes , 2003, Bioinform..

[30]  N. W. Davis,et al.  Genome sequence of enterohaemorrhagic Escherichia coli O157:H7 , 2001, Nature.

[31]  Rodolphe Barrangou,et al.  The genome sequence of the probiotic intestinal bacterium Lactobacillus johnsonii NCC 533. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Shorte,et al.  The SPI-2 type III secretion system restricts motility of Salmonella-containing vacuoles , 2007, Cellular microbiology.

[33]  V. Fischetti,et al.  Induction of Lysogenic Bacteriophage and Phage-Associated Toxin from Group A Streptococci during Coculture with Human Pharyngeal Cells , 2001, Infection and Immunity.

[34]  Guy Plunkett,et al.  Comparative Genomics of Salmonellaenterica Serovar Typhi Strains Ty2 and CT18 , 2003, Journal of bacteriology.

[35]  Kim Rutherford,et al.  Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18 , 2001, Nature.

[36]  B. Spellerberg,et al.  Surface proteins of Streptococcus agalactiae and horizontal gene transfer. , 2004, International journal of medical microbiology : IJMM.

[37]  James M. Musser,et al.  Prophage Induction and Expression of Prophage-EncodedVirulence Factors in Group A Streptococcus Serotype M3 StrainMGAS315 , 2003, Infection and Immunity.

[38]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[39]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[40]  F. Blattner,et al.  Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[41]  R. Wilson,et al.  Complete genome sequence of Salmonella enterica serovar Typhimurium LT2 , 2001, Nature.

[42]  R. Novick,et al.  The SaPIs: mobile pathogenicity islands of Staphylococcus. , 2007, Chemical immunology and allergy.

[43]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[44]  James M. Musser,et al.  Molecular Correlates of Host Specialization in Staphylococcus aureus , 2007, PloS one.

[45]  Matthew Berriman,et al.  ACT: the Artemis comparison tool , 2005, Bioinform..

[46]  Roderic D. M. Page,et al.  TreeView: an application to display phylogenetic trees on personal computers , 1996, Comput. Appl. Biosci..

[47]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[48]  Herbert Schmidt,et al.  Pathogenicity Islands in Bacterial Pathogenesis , 2004, Clinical Microbiology Reviews.

[49]  I. Margarit,et al.  Identification of novel genomic islands coding for antigenic pilus‐like structures in Streptococcus agalactiae , 2006, Molecular microbiology.

[50]  Rekha R Meyer,et al.  Comparison of genome degradation in Paratyphi A and Typhi, human-restricted serovars of Salmonella enterica that cause typhoid , 2004, Nature Genetics.

[51]  A. Ankai,et al.  Whole-Genome Sequencing of Staphylococcus haemolyticus Uncovers the Extreme Plasticity of Its Genome and the Evolution of Human-Colonizing Staphylococcal Species , 2005, Journal of bacteriology.

[52]  Kelly P. Williams,et al.  Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities , 2004, Nucleic Acids Res..

[53]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[54]  Glen McGillivary,et al.  Cloning and Sequencing of a Genomic Island Found in the Brazilian Purpuric Fever Clone of Haemophilus influenzae Biogroup Aegyptius , 2005, Infection and Immunity.

[55]  Carmen Buchrieser,et al.  Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease , 2002, Molecular microbiology.

[56]  Kelly P Williams,et al.  Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. , 2002, Nucleic acids research.

[57]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[58]  M. Borodovsky,et al.  cag, a pathogenicity island of Helicobacter pylori, encodes type I-specific and disease-associated virulence factors. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[59]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.