A heuristic approach to determine an appropriate number of topics in topic modeling

BackgroundTopic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach.Methods and resultsBased on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed.ConclusionThe proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[3]  P. Bork,et al.  A side effect resource to capture phenotypic effects of drugs , 2010, Molecular systems biology.

[4]  Robert F. Murphy,et al.  Quantifying the distribution of probes between subcellular locations using unsupervised pattern unmixing , 2010, Bioinform..

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7]  Colin Campbell,et al.  The latent process decomposition of cDNA microarray data sets , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Weizhong Zhao,et al.  Topic modeling for cluster analysis of large biological and medical datasets , 2014, BMC Bioinformatics.

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  Balaraman Ravindran,et al.  Multi-view methods for protein structure comparison using latent dirichlet allocation , 2011, Bioinform..

[11]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[14]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[15]  R. Leipus,et al.  The change-point problem for dependent observations , 1996 .

[16]  F. Weill,et al.  WHO Collaborating Centre for Reference and Research on Salmonella ANTIGENIC FORMULAE OF THE SALMONELLA SEROVARS , 2007 .

[17]  Colin Campbell,et al.  The Latent Process Decomposition of cDNA Microarray Data Sets , 2005, TCBB.

[18]  Martin Halvey,et al.  An assessment of tag presentation techniques , 2007, WWW '07.