Probabilistic topic modeling for genomic data interpretation

Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the ‘N-mer’ and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’ features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.

[1]  David W Ussery,et al.  Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray , 2007, Genome Biology.

[2]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[3]  Bin Zheng,et al.  BMC Bioinformatics BioMed Central , 2005 .

[4]  Tao Jiang,et al.  OligoSpawn: a software tool for the design of overgo probes from large unigene datasets , 2006, BMC Bioinformatics.

[5]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[6]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[7]  Daniel H. Huson,et al.  Methods for comparative metagenomics , 2009, BMC Bioinformatics.

[8]  Garth D Ehrlich,et al.  What makes pathogens pathogenic , 2008, Genome Biology.

[9]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Sonja J. Prohaska,et al.  “Genes” , 2008, Theory in Biosciences.

[12]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[13]  Gail L. Rosen,et al.  Signal Processing for Metagenomics: Extracting Information from the Soup , 2009, Current genomics.

[14]  G. Rosen,et al.  A text-mining approach for classification of genomic fragments , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.