Finding maximal ranges with unique topics in a text database

Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of uniquetopics, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.

[1]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[4]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[5]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[6]  Jonathan G. Fiscus,et al.  Topic detection and tracking evaluation overview , 2002 .

[7]  Atsuhiro Takasu,et al.  Dynamic hyperparameter optimization for bayesian topical trend analysis , 2009, CIKM.

[8]  Thomas L. Griffiths,et al.  A probabilistic approach to semantic representation , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.

[9]  I. Herstein,et al.  Topics in algebra , 1964 .

[10]  Kotagiri Ramamohanarao,et al.  Efficiently Mining Interesting Emerging Patterns , 2003, WAIM.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Stanley B. Zdonik,et al.  Interactive data exploration using semantic windows , 2014, SIGMOD Conference.

[13]  Thomas L. Griffiths,et al.  Prediction and Semantic Association , 2002, NIPS.

[14]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[15]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[16]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[17]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[18]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[19]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[20]  Charles Elkan,et al.  Accounting for burstiness in topic models , 2009, ICML '09.

[21]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[22]  Dimitrios Gunopulos,et al.  On burstiness-aware search for document sequences , 2009, KDD.

[23]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[24]  Jiawei Han,et al.  Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases , 2009, SDM.

[25]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[26]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[27]  Timothy Baldwin,et al.  On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[28]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[29]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[30]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .