Topic modeling for OLAP on multidimensional text databases: topic cube and its applications

As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. Although online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand, probabilistic topic models are among the most effective approaches to latent topic analysis and mining on text data. In this paper, we study a new data model called topic cube to combine OLAP with probabilistic topic modeling and enable OLAP on the dimension of text data in a multidimensional text database. Topic cube extends the traditional data cube to cope with a topic hierarchy and stores probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes efficiently, we propose two heuristic aggregations to speed up the iterative Expectation-Maximization (EM) algorithm for estimating topic models by leveraging the models learned on component data cells to choose a good starting point for iteration. Experimental results show that these heuristic aggregations are much faster than the baseline method of computing each topic cube from scratch. We also discuss some potential uses of topic cube and show sample experimental results. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 378-395, 2009

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[3]  Mukesh K. Mohania,et al.  Efficiently linking text documents with relevant structured information , 2006, VLDB.

[4]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[5]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[6]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[7]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[8]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[9]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[10]  ChengXiang Zhai,et al.  Generating Impact-Based Summaries for Scientific Literature , 2008, ACL.

[11]  E. G. Lyman,et al.  NASA aviation safety reporting system , 1976 .

[12]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[13]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[14]  Jian Pei,et al.  Ix-cubes: iceberg cubes for data warehousing and olap on xml data , 2007, CIKM '07.

[15]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Koichi Takeda,et al.  A method for online analytical processing of text data , 2007, CIKM '07.

[18]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[19]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[20]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[21]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[22]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[23]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[24]  David Wai-Lok Cheung,et al.  OLAP on sequence data , 2008, SIGMOD Conference.

[25]  W. Scott Spangler,et al.  The integration of business intelligence and knowledge management , 2002, IBM Syst. J..

[26]  Torben Bach Pedersen,et al.  A relevance-extended multi-dimensional model for a data warehouse contextualized with documents , 2005, DOLAP '05.

[27]  Berthold Reinwald,et al.  Towards keyword-driven analytical processing , 2007, SIGMOD '07.

[28]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Luis Gravano,et al.  Modeling and managing changes in text databases , 2007, TODS.

[31]  Jiawei Han,et al.  Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases , 2009, SDM.

[32]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[33]  Berthold Reinwald,et al.  Multidimensional content eXploration , 2008, Proc. VLDB Endow..

[34]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.