Probabilistic topic modeling for the analysis and classification of genomic sequences

BackgroundStudies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques.MethodsThe presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences.Results and conclusionsWe performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.

[1]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[2]  Massimo La Rosa,et al.  Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[3]  Massimo La Rosa,et al.  Soft Topographic Maps for Clustering and Classifying Bacteria Using Housekeeping Genes , 2011, Adv. Artif. Neural Syst..

[4]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[5]  Vladimir Pavlovic,et al.  Fast Kernel Methods for SVM Sequence Classifiers , 2007, WABI.

[6]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[7]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[10]  Pietro Perona,et al.  Unsupervised Organization of Image Collections: Taxonomies and Beyond , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  A. Oskooi Molecular Evolution and Phylogenetics , 2008 .

[12]  Lawrence K. Saul,et al.  10 th International Society for Music Information Retrieval Conference ( ISMIR 2009 ) A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES , 2009 .

[13]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  Vladimir Pavlovic,et al.  Efficient alignment-free DNA barcode analytics , 2009, BMC Bioinformatics.

[16]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[17]  D. Raoult,et al.  Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans , 2004, Journal of Clinical Microbiology.

[18]  Antonino Fiannaca,et al.  Analysis of DNA Barcode Sequences Using Neural Gas and Spectral Representation , 2013, EANN.

[19]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Antonino Fiannaca,et al.  Alignment-free analysis of barcode sequences by means of compression-based methods , 2013, BMC Bioinformatics.

[21]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[22]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[23]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[24]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[25]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[26]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[27]  K. Gaston Global patterns in biodiversity , 2000, Nature.

[28]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[29]  BMC Bioinformatics , 2005 .

[30]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[31]  Antonino Fiannaca,et al.  Genomic Sequence Classification Using Probabilistic Topic Modeling , 2013, CIBB.

[32]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[33]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[34]  Antonino Fiannaca,et al.  A Study of Compression-Based Methods for the Analysis of Barcode Sequences , 2012, CIBB.

[35]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[36]  Giuseppe Di Fatta,et al.  Soft Topographic Map for Clustering and Classification of Bacteria , 2007, IDA.

[37]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[38]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[39]  P. Hebert,et al.  Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[40]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[42]  Massimo La Rosa,et al.  Comparison of Genomic Sequences Clustering Using Normalized Compression Distance and Evolutionary Distance , 2008, KES.

[43]  David Laurenson,et al.  Estimating clean speech thresholds for perceptual based speech enhancement , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[44]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[45]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[46]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[47]  Shrikanth S. Narayanan,et al.  Acoustic topic model for audio information retrieval , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[48]  Eoin L Brodie,et al.  Simrank: Rapid and sensitive general-purpose k-mer search tool , 2011, BMC Ecology.

[49]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[50]  D. Raoult,et al.  Sequence-Based Identification of New Bacteria: a Proposition for Creation of an Orphan Bacterium Repository , 2005, Journal of Clinical Microbiology.

[51]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.