A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

MOTIVATION The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called sp acer oligo nucleotide typing (spoligotyping ). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly. RESULTS By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.

[1]  N Rastogi,et al.  Genetic Diversity of Mycobacterium africanum Clinical Isolates Based on IS6110-Restriction Fragment Length Polymorphism Analysis, Spoligotyping, and Variable Number of Tandem DNA Repeats , 2001, Journal of Clinical Microbiology.

[2]  P. Groenen,et al.  Nature of DNA polymorphism in the direct repeat cluster of Mycobacterium tuberculosis; application for strain differentiation by a novel typing method , 1993, Molecular microbiology.

[3]  N Rastogi,et al.  Detection of a Previously Unamplified Spacer within the DR Locus of Mycobacterium tuberculosis: Epidemiological Implications , 2000, Journal of Clinical Microbiology.

[4]  D van Soolingen,et al.  Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology , 1997, Journal of clinical microbiology.

[5]  R. Frothingham,et al.  Genetic diversity in the Mycobacterium tuberculosis complex based on variable numbers of tandem DNA repeats. , 1998, Microbiology.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  N Rastogi,et al.  Use of Spoligotyping To Study the Evolution of the Direct Repeat Locus by IS6110 Transposition inMycobacterium tuberculosis , 2001, Journal of Clinical Microbiology.

[8]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[9]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[10]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[11]  Philip Supply,et al.  Variable human minisatellite‐like regions in the Mycobacterium tuberculosis genome , 2000, Molecular microbiology.

[12]  Michael J. Ryan,et al.  Rumors of disease in the global village: outbreak verification. , 2000, Emerging infectious diseases.

[13]  R. Frothingham,et al.  Comparison of Methods Based on Different Molecular Epidemiological Markers for Typing of Mycobacterium tuberculosis Complex Strains: Interlaboratory Study of Discriminatory Power and Reproducibility , 1999, Journal of Clinical Microbiology.

[14]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[15]  K. Eisenach,et al.  Characterization of Mycobacterium tuberculosis complex direct repeat sequence for use in cycling probe reaction , 1996, Journal of clinical microbiology.

[16]  F. Dias,et al.  Evolution and Clonal Traits of Mycobacterium tuberculosis Complex in Guinea-Bissau , 1999, Journal of Clinical Microbiology.

[17]  J VanEmbden,et al.  Molecular epidemiology of tuberculosis: coming of age. , 2000 .

[18]  D van Soolingen,et al.  Insertion element IS987 from Mycobacterium bovis BCG is located in a hot-spot integration region for insertion elements in Mycobacterium tuberculosis complex strains , 1991, Infection and immunity.

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  P. Palittapongarnpim,et al.  IS6110-Mediated Deletions of Wild-Type Chromosomes of Mycobacterium tuberculosis , 1999, Journal of bacteriology.

[21]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[22]  J T Douglas,et al.  Predominance of a single genotype of Mycobacterium tuberculosis in countries of east Asia , 1995, Journal of clinical microbiology.

[23]  J Van Embden,et al.  Molecular epidemiology of tuberculosis: coming of age. , 2000, The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[24]  K. Lok,et al.  Identification of a Contaminating Mycobacterium tuberculosis Strain with a Transposition of an IS6110Insertion Element Resulting in an Altered Spoligotype , 2001, Journal of Clinical Microbiology.

[25]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[26]  F. Rodríguez-Valera,et al.  Long stretches of short tandem repeats are present in the largest replicons of the Archaea Haloferax mediterranei and Haloferax volcanii and could be involved in replicon partitioning , 1995, Molecular microbiology.

[27]  Richard Nock,et al.  Instance Pruning as an Information Preserving Problem , 2000, ICML.

[28]  Ruud Jansen,et al.  Genetic Variation and Evolutionary Origin of the Direct Repeat Locus of Mycobacterium tuberculosis Complex Bacteria , 2000, Journal of bacteriology.

[29]  J. Musser,et al.  Characterization of Mycobacterium tuberculosis Isolates from Patients in Houston, Texas, by Spoligotyping , 2000, Journal of Clinical Microbiology.

[30]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.