Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures

MOTIVATION Analyses of genomic signatures are gaining attention as they allow studies of species-specific relationships without involving alignments of homologous sequences. A naïve Bayesian classifier was built to discriminate between different bacterial compositions of short oligomers, also known as DNA words. The classifier has proven successful in identifying foreign genes in Neisseria meningitis. In this study we extend the classifier approach using either a fixed higher order Markov model (Mk) or a variable length Markov model (VLMk). RESULTS We propose a simple algorithm to lock a variable length Markov model to a certain number of parameters and show that the use of Markov models greatly increases the flexibility and accuracy in prediction to that of a naïve model. We also test the integrity of classifiers in terms of false-negatives and give estimates of the minimal sizes of training data. We end the report by proposing a method to reject a false hypothesis of horizontal gene transfer. AVAILABILITY Software and Supplementary information available at www.cs.chalmers.se/~dalevi/genetic_sign_classifiers/.

[1]  Lila Kari,et al.  The spectrum of genomic signatures: from dinucleotides to chaos game representation. , 2005, Gene.

[2]  Rickard Sandberg,et al.  Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. , 2003, Gene.

[3]  Niklas Eriksen,et al.  Measuring Genome Divergence in Bacteria: A Case Study Using Chlamydian Data , 2002, Journal of Molecular Evolution.

[4]  H. Ochman,et al.  Amelioration of Bacterial Genomes: Rates of Change and Exchange , 1997, Journal of Molecular Evolution.

[5]  Michael T. Hallett,et al.  New algorithms for the duplication-loss model , 2000, RECOMB '00.

[6]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[7]  P. Langford,et al.  Natural genetic exchange between Haemophilus and Neisseria: intergeneric transfer of chromosomal genes between major human pathogens. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  T. Sicheritz-Pontén,et al.  A phylogenomic approach to microbial evolution. , 2001, Nucleic acids research.

[9]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[10]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[11]  M. Borodovsky,et al.  Recognition of genes in DNA sequence with ambiguities. , 1993, Bio Systems.

[12]  S. Osawa,et al.  The guanine and cytosine content of genomic DNA and bacterial evolution. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[13]  J. Mortimer,et al.  Chargaff's legacy. , 2000, Gene.

[14]  C. Woese,et al.  Bacterial evolution , 1987, Microbiological reviews.

[15]  T P Speed,et al.  Atypical regions in large genomic DNA sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  L. Orgel,et al.  Phylogenetic Classification and the Universal Tree , 1999 .

[18]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[19]  Doolittle Wf Phylogenetic Classification and the Universal Tree , 1999 .

[20]  Tao Jiang,et al.  Identifying transcription factor binding sites through Markov chain optimization , 2002, ECCB.

[21]  P. Sharp,et al.  Codon usage and genome evolution. , 1994, Current opinion in genetics & development.

[22]  Shang-Jung Lee,et al.  Genomic Conflict Settled in Favour of the Species Rather Than the Gene at Extreme GC Percentage Values , 2004, Applied bioinformatics.

[23]  Alain Giron,et al.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature , 2005, Nucleic acids research.

[24]  B. Wilkins,et al.  Distribution of restriction enzyme recognition sequences on broad host range plasmid RP4: molecular and evolutionary implications. , 1996, Journal of molecular biology.

[25]  B. Efron Bootstrap confidence intervals for a class of parametric problems , 1985 .

[26]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[27]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[28]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[29]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[30]  M. Adamczyk,et al.  Spread and survival of promiscuous IncP-1 plasmids. , 2003, Acta biochimica Polonica.

[31]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[32]  Sean D. Hooper,et al.  Detection of Genes with Atypical Nucleotide Sequence in Microbial Genomes , 2002, Journal of Molecular Evolution.

[33]  Peter Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[34]  L. Koski,et al.  Codon bias and base composition are poor indicators of horizontally transferred genes. , 2001, Molecular biology and evolution.

[35]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[36]  H. Matsuda,et al.  Biased biological functions of horizontally transferred genes in prokaryotic genomes , 2004, Nature Genetics.