Bacteria classification has been deeply investigated with different tools for many purposes,such as early diagnosis, metagenomics, phylogenetics. Classification methods based on ribosomalDNA sequences are considered a reference in this area. We present a new classificatier for bacteriaspecies based on a dissimilarity measure of purely combinatorial nature. This measure is based onthe notion of Minimal Absent Words, a combinatorial definition that recently found applications inbioinformatics. We can therefore incorporate this measure into a probabilistic neural network in orderto classify bacteria species. Our approach is motivated by the fact that there is a vast literature on thecombinatorics of Minimal Absent Words in relation with the degree of repetitiveness of a sequence.We ran our experiments on a public dataset of Ribosomal RNA Sequences from the complex 16S. Ourapproach showed a very high score in the accuracy of the classification, proving hence that our methodis comparable with the standard tools available for the automatic classification of bacteria species.
[1]
Donald F. Specht,et al.
Probabilistic neural networks
,
1990,
Neural Networks.
[2]
Antonio Restivo,et al.
Automata and Forbidden Words
,
1998,
Inf. Process. Lett..
[3]
James R. Cole,et al.
The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
,
2008,
Nucleic Acids Res..
[4]
Maxime Crochemore,et al.
Using minimal absent words to build phylogeny
,
2012,
Theor. Comput. Sci..
[5]
Antonino Fiannaca,et al.
A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network
,
2015,
Artif. Intell. Medicine.
[6]
B. Haas,et al.
A Catalog of Reference Genomes from the Human Microbiome
,
2010,
Science.
[7]
Wee Ser,et al.
Probabilistic neural-network structure determination for pattern classification
,
2000,
IEEE Trans. Neural Networks Learn. Syst..