Vocabulary size and its effect on topic representation

The impact of vocabulary reduction on topic modeling is explored for three data sets.Results are compared using four document and topic-centered measures.Removal of singly occurring terms has minimal influence on topics.Removal of frequently occurring terms greatly influences topic outcomes for three measures. This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.

[1]  Liangcai Gao,et al.  Chronological Citation Recommendation with Information-Need Shifting , 2015, CIKM.

[2]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[3]  Marie-Francine Moens,et al.  Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications , 2015, Inf. Process. Manag..

[4]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[5]  Padhraic Smyth,et al.  Subject metadata enrichment using statistical topic models , 2007, JCDL '07.

[6]  Miles Efron,et al.  Eigenvalue-based model selection during latent semantic indexing: Research Articles , 2005 .

[7]  Peter Haddawy,et al.  Analyzing knowledge flows of scientific literature through semantic links: a case study in the field of energy , 2015, Scientometrics.

[8]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[9]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[10]  Paolo Napoletano,et al.  Weighted Word Pairs for query expansion , 2015, Inf. Process. Manag..

[11]  Miles Efron,et al.  Eigenvalue-based model selection during latent semantic indexing , 2005, J. Assoc. Inf. Sci. Technol..

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Wei Gao,et al.  A link-bridged topic model for cross-domain document classification , 2013, Inf. Process. Manag..

[14]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[15]  Roi Blanco,et al.  Probabilistic static pruning of inverted files , 2010, TOIS.

[16]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[18]  Il-Chul Moon,et al.  Associative topic models with numerical time series , 2015, Inf. Process. Manag..

[19]  Dietmar Wolfram,et al.  The impact of term-indexing characteristics on a document space , 2001 .

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Enhong Chen,et al.  Exploiting probabilistic topic models to improve text categorization under class imbalance , 2011, Inf. Process. Manag..

[22]  Kheireddine Abainia,et al.  Effective language identification of forum texts based on statistical approaches , 2016, Inf. Process. Manag..

[23]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[24]  Kun Lu,et al.  Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches , 2012, J. Assoc. Inf. Sci. Technol..

[25]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[26]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Marie-Francine Moens,et al.  Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora , 2013, Information Retrieval.

[28]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[29]  Kazushi Ikeda,et al.  Extracting Search Query Patterns via the Pairwise Coupled Topic Model , 2016, WSDM '16.

[30]  Joemon M. Jose,et al.  Text segmentation: A topic modeling perspective , 2011, Inf. Process. Manag..

[31]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[32]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[33]  Erjia Yan,et al.  Research dynamics, impact, and dissemination: A topic‐level analysis , 2015, J. Assoc. Inf. Sci. Technol..

[34]  Ying Ding,et al.  Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks , 2011, J. Informetrics.

[35]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[36]  Thomas Demeester,et al.  Topical Word Importance for Fast Keyphrase Extraction , 2015, WWW.

[37]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[38]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[39]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[40]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[41]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[42]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[43]  Cristina Ribeiro,et al.  Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model , 2015, Inf. Process. Manag..

[44]  Dong Zhou,et al.  Improving search via personalized query expansion using social media , 2012, Information Retrieval.

[45]  Ying Ding,et al.  Topic-based PageRank on author cocitation networks , 2011, J. Assoc. Inf. Sci. Technol..

[46]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[47]  James Allan,et al.  A Comparative Study of Utilizing Topic Models for Information Retrieval , 2009, ECIR.

[48]  Iddo Eliazar,et al.  The growth statistics of Zipfian ensembles: Beyond Heaps’ law , 2011 .

[49]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[50]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[51]  Jin Zhang,et al.  The influence of indexing practices and weighting algorithms on document spaces , 2008 .