Multi-Dimensional, Phrase-Based Summarization in Text Cubes

To systematically analyze large numbers of textual documents, it is often desirable to manage documents (and their metadata) in a multi-dimensional text database (Text Cube). Such structure provides flexibility of understanding local information with different granularities. Moreover, the contextualized analysis derived from cube structure often yields comparative insights. To quickly digest the content of subsets of documents in the multi-dimensional context, we study the problem of phrase-based summarization of a subset of documents of interest. We propose a new phrase ranking measure to leverage the relation between document subsets induced by multi-dimensional context and identify phrases that truly distinguish the queried subset of documents from neighboring subsets (i.e., background). Our quality evaluation suggests the new measure involving dynamic, query-dependent background generation is more effective than previous measures using the whole corpus as a static background for finding representative phrases. Computing this measure is more expensive due to the need of access to many subsets of documents to answer one query. We develop a cube-based analytical platform that implements an efficient solution by materializing a deliberately selected part of statistics, and using these statistics to perform online query processing within a constant latency constraint. Our experiments in a large news dataset demonstrate the efficiency in both query processing time and storage cost.

[1]  Torben Bach Pedersen,et al.  Contextualizing data warehouses with documents , 2008, Decis. Support Syst..

[2]  Gerhard Weikum,et al.  Interesting-phrase mining for ad-hoc text analytics , 2010, Proc. VLDB Endow..

[3]  Heng Ji,et al.  EventCube: multi-dimensional search and mining of structured and text data , 2013, KDD.

[4]  Yizhou Sun,et al.  NewsNetExplorer: automatic construction and exploration of news information networks , 2014, SIGMOD Conference.

[5]  Olivier Teste,et al.  Top_Keyword: An Aggregation Function for Textual Document OLAP , 2008, DaWaK.

[6]  Marti A. Hearst Clustering versus faceted categories for information exploration , 2006, Commun. ACM.

[7]  Koichi Takeda,et al.  A method for online analytical processing of text data , 2007, CIKM '07.

[8]  Berthold Reinwald,et al.  Multidimensional content eXploration , 2008, Proc. VLDB Endow..

[9]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[10]  Jun Rao,et al.  Dynamic faceted search for discovery-driven analysis , 2008, CIKM '08.

[11]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[12]  Mike Thelwall,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[13]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14]  Bo Zhao,et al.  TopCells: Keyword-based search of top-k aggregated documents in text cube , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Daniel Tunkelang,et al.  Faceted Search , 2009, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[16]  Bo Zhao,et al.  TEXplorer: keyword-based object search and exploration in multidimensional text databases , 2011, CIKM '11.

[17]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.