A Framework for Task-specific Short Document Expansion

Collections that contain a large number of short texts are becoming increasingly common (eg., tweets, reviews, etc). Analytical tasks (such as classification, clustering, etc.) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is low accuracy on the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain knowledge to expand the text; (ii) they employ task-specific heuristics; and (iii) the expansion procedure is tightly coupled to the task. This makes it hard to adapt a procedure, designed for one task, into another. We present an expansion technique -- TIDE (Task-specIfic short Document Expansion) -- that can be applied on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task specific heuristics and domain-specific knowledge for expansion. At the same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce different expansions depending upon what expansion benefits the task's performance. To speed up the learning process, we also introduce a technique called block learning. Our experiments with classification and clustering tasks show that our framework improves upon several baselines according to the standard evaluation metrics which includes the accuracy and normalized mutual information (NMI).

[1]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[2]  Ian H. Witten,et al.  Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[3]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[4]  Aykut Erdem,et al.  A Distributed Representation Based Query Expansion Approach for Image Captioning , 2015, ACL.

[5]  Vasile Rus,et al.  Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics , 2012, FLAIRS Conference.

[6]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[9]  Vasile Rus,et al.  Similarity Measures Based on Latent Dirichlet Allocation , 2013, CICLing.

[10]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[11]  M. Powell The BOBYQA algorithm for bound constrained optimization without derivatives , 2009 .

[12]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[13]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Rafael Dueire Lins,et al.  A new sentence similarity assessment measure based on a three-layer sentence representation , 2014, DocEng '14.

[16]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[17]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[18]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[21]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[24]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[25]  Péter Schönhofen,et al.  Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[26]  Peng Wang,et al.  Short Text Clustering via Convolutional Neural Networks , 2015, VS@HLT-NAACL.

[27]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.