Identification of compositionally distinct regions in genomes using the centroid method

MOTIVATION It is known that most genomic regions of special interest, e.g. horizontally acquired sequences, genomic islands, etc. have distinct word (m-mer) compositions. Most of the earlier work along this direction, addressed di- and tri-nucleotide compositions. We present an approach that can be applied to analyze compositions of any given word size. The method, called the centroid approach, can reveal compositionally distinct regions in genomic sequences for any given word size. RESULTS We applied our method to 50 bacterial genomes and demonstrated its ability to identify embedded sequences of varying lengths from distantly related organisms. We also investigated the genetic makeup of the regions identified as compositionally distinct by our method, for four organisms from our dataset. Pathogenicity island (PAI) components and genes encoding strain-specific proteins are all frequently seen to be constituents of these regions. AVAILABILITY Program is available on request from the authors. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  L. Burrows,et al.  Molecular characterization of the Pseudomonas aeruginosa serotype O5 (PAO1) B‐band lipopolysaccharide gene cluster , 1996, Molecular microbiology.

[2]  C. Ronson,et al.  Evolution of rhizobia by acquisition of a 500-kb symbiosis island that integrates into a phe-tRNA gene. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  T. Whittam,et al.  Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[5]  Qiang Tu,et al.  Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis. , 2003, FEMS microbiology letters.

[6]  Claudine Médigue,et al.  Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. , 2002, Microbiology.

[7]  Ren Zhang,et al.  A systematic method to identify genomic islands and its applications in analyzing the genomes of Corynebacterium glutamicum and Vibrio vulnificus CMCP6 chromosome I , 2004, Bioinform..

[8]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[9]  J. Lam,et al.  Chromosomal mapping, expression and synthesis of lipopolysaccharide in Pseudomonas aeruginosa: a role for guanosine diphospho (GDP)‐D‐mannose , 1993, Molecular microbiology.

[10]  Kelly P. Williams,et al.  Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities , 2004, Nucleic Acids Res..

[11]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[12]  Raghunath Chatterjee,et al.  Unsupervised statistical identification of genomic islands using oligonucleotide distributions with application toVibrio genomes , 2006 .

[13]  Stefano Milani,et al.  Helicobacter pylori cag Pathogenicity Island Is Associated with Reduced Expression of Interleukin-4 (IL-4) mRNA and Modulation of the IL-4δ2 mRNA Isoform in Human Gastric Mucosa , 2003, Infection and Immunity.

[14]  Alain Giron,et al.  Genomic signature is preserved in short DNA fragments , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[15]  K. Dybvig,et al.  Gene Rearrangements in the vsa Locus ofMycoplasma pulmonis , 2000, Journal of bacteriology.

[16]  S Karlin,et al.  Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. , 2001, Trends in microbiology.

[17]  S T Cole,et al.  Learning from the genome sequence of Mycobacterium tuberculosis H37Rv , 1999, FEBS letters.

[18]  J R Lindsey,et al.  Differences in virulence for mice among strains of Mycoplasma pulmonis , 1988, Infection and immunity.

[19]  J Hacker,et al.  Deletions of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo in various extraintestinal Escherichia coli isolates. , 1990, Microbial pathogenesis.

[20]  Sayera Banu,et al.  Are the PE‐PGRS proteins of Mycobacterium tuberculosis variable surface antigens? , 2002, Molecular microbiology.

[21]  J. Hacker,et al.  Excision of large DNA regions termed pathogenicity islands from tRNA-specific loci in the chromosome of an Escherichia coli wild-type pathogen , 1994, Infection and immunity.

[22]  S. Karlin,et al.  Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R Zhang,et al.  A Novel Method to Calculate the G+C Content of Genomic DNA Sequences , 2001, Journal of biomolecular structure & dynamics.

[24]  Kumar Rajakumar,et al.  A novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites in closely related bacteria , 2006, Nucleic acids research.

[25]  Jürgen Heesemann,et al.  Chromosomal-encoded siderophores are required for mouse virulence of enteropathogenic Yersinia species , 1987 .