Using Word Embeddings with Linear Models for Short Text Classification

Text documents often contain information relevant for a particular domain in short “snippets”. The social science field of peace and conflict studies is such a domain, where identifying, classifying and tracking drivers of conflict from text sources is important, and snippets are typically classified by human analysts using an ontology. One issue in automating this process is that snippets tend to contain infrequent “rare” terms which lack class-conditional evidence. In this work we develop a method to enrich a bag-of-words model by complementing rare terms in the text to be classified with related terms from a Word Vector model. This method is then combined with standard linear text classification algorithms. By reducing sparseness in the bag-of-words, these enriched models perform better than the baseline classifiers. A second issue is to improve performance on “small” classes having only a few examples, and here we show that Paragraph Vectors outperform the enriched models.