Doc2Cube: Automated Document Allocation to Text Cube via Dimension-Aware Joint Embedding

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cells in such a text cube so that quality multidimensional analysis can be conducted afterwards—it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on a real news corpus demonstrate that Doc2Cube outperforms existing methods significantly. Doc2Cube is a technology transferred to U.S. Army Research Lab and is a core component of the EventCube system that is being deployed for multidimensional news and social media data analysis.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[3]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[4]  Jiawei Han,et al.  Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases , 2009, SDM.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Torben Bach Pedersen,et al.  Contextualizing data warehouses with documents , 2008, Decis. Support Syst..

[7]  Jean-Michel Renders,et al.  Large-scale hierarchical text classification without labelled data , 2011, WSDM '11.

[8]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[9]  Arushi Kohli,et al.  An Overview of Data Warehousing and OLAP Technology , 2014 .

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[12]  Utpal Garain,et al.  Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.

[13]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[14]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[15]  Olivier Teste,et al.  Top_Keyword: An Aggregation Function for Textual Document OLAP , 2008, DaWaK.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[18]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[19]  Koichi Takeda,et al.  A method for online analytical processing of text data , 2007, CIKM '07.

[20]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[21]  Peng Jin,et al.  Dataless Text Classification with Descriptive LDA , 2015, AAAI.

[22]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[23]  Bo Zhao,et al.  TopCells: Keyword-based search of top-k aggregated documents in text cube , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[24]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.

[25]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[29]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[30]  Jiawei Han,et al.  Multi-Dimensional, Phrase-Based Summarization in Text Cubes , 2016, IEEE Data Eng. Bull..

[31]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[32]  Heng Ji,et al.  EventCube: multi-dimensional search and mining of structured and text data , 2013, KDD.

[33]  Jiawei Han,et al.  Large-Scale Embedding Learning in Heterogeneous Event Data , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[34]  Bo Zhao,et al.  TEXplorer: keyword-based object search and exploration in multidimensional text databases , 2011, CIKM '11.

[35]  Qiaozhu Mei,et al.  PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.