Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks

Many machine learning and natural language processing approaches, including supervised and domain adaptation algorithms, have been proposed and studied in the context of filtering crisis tweets. However, the application of these approaches in practice is still challenging due to the time-critical requirements of emergency response operations, and also to the diversity and unique characteristics of emergency events. To address this limitation, we explore the idea of building “generalized” classifiers for filtering crisis tweets, classifiers which can be pre-trained and ready to use in real-time, while they generalize well on tweets from future disasters. We propose to achieve this objective using a simple feature-based adaptation approach, where tweets are represented as dense numeric vectors of reduced dimensionality using either word embeddings or sentence encodings. Given that several types of word embeddings and sentence encodings exist, we compare tweet representations corresponding to different word embeddings and sentence encodings with the goal of understanding what embeddings/encodings are more suitable for use in crisis tweet classification tasks. Our experimental results on three crisis tweet classification tasks suggest that the tweet representations based on GloVe embeddings produce better results than the representations that use other embeddings, when employed with traditional supervised learning algorithms. Furthermore, the GloVe embeddings trained on crisis data produce better results on more specific crisis tweet classification tasks (e.g., tweets informative versus non-informative), while the GloVe embeddings pre-trained on a large collection of general tweets produce better results on more general classification tasks (tweets relevant or not relevant to a crisis).

[1]  Carlos Castillo,et al.  What to Expect When the Unexpected Happens: Social Media Communications Across Crises , 2015, CSCW.

[2]  Cornelia Caragea,et al.  Mapping moods: Geo-mapped sentiment analysis during hurricane sandy , 2014, ISCRAM.

[3]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[4]  C. Castillo,et al.  Big Crisis Data: Social Media in Disasters and Time-Critical Situations , 2019 .

[5]  Leysia Palen,et al.  Chatter on the red: what hazards threat reveals about the social life of microblogged information , 2010, CSCW '10.

[6]  Shafiq R. Joty,et al.  Applications of Online Deep Learning for Crisis Response Using Social Media Information , 2016, ArXiv.

[7]  Sarah Vieweg,et al.  Processing Social Media Messages in Mass Emergency , 2014, ACM Comput. Surv..

[8]  Starr Roxanne Hiltz,et al.  Red Tape: Attitudes and Issues Related to Use of Social Media by U.S. County-Level Emergency Managers , 2015, ISCRAM.

[9]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[10]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[11]  Cornelia Caragea,et al.  Twitter Mining for Disaster Response: A Domain Adaptation Approach , 2015, ISCRAM.

[12]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[13]  Cornelia Caragea,et al.  Disaster Response Aided by Tweet Classification with a Domain Adaptation Approach , 2018 .

[14]  Muhammad Imran,et al.  Cross-Language Domain Adaptation for Classifying Crisis-Related Short Messages , 2016, ISCRAM.

[15]  Fernando Diaz,et al.  CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises , 2014, ICWSM.

[16]  Amanda Lee Hughes,et al.  Social Media in Disaster Communication , 2018 .

[17]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[18]  Manaal Faruqui,et al.  Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[19]  Peng Wang,et al.  Semantic Clustering and Convolutional Neural Network for Short Text Categorization , 2015, ACL.

[20]  R.J.P. Stronkman,et al.  Towards a realtime Twitter analysis during crises for operational crisis management , 2012, ISCRAM.

[21]  Thomas Ludwig,et al.  Social Media and Emergency Services?: Interview Study on Current and Potential Use in 7 European Countries , 2015, Int. J. Inf. Syst. Crisis Response Manag..

[22]  Muhammad Imran,et al.  Classification of Twitter Disaster Data Using a Hybrid Feature-Instance Adaptation Approach , 2018, ISCRAM.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Christopher D. Manning,et al.  Evaluating Word Embeddings Using a Representative Suite of Practical Tasks , 2016, RepEval@ACL.

[25]  Leysia Palen,et al.  Natural Language Processing to the Rescue? Extracting "Situational Awareness" Tweets During Mass Emergency , 2011, ICWSM.

[26]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[27]  Shady Elbassuoni,et al.  Practical extraction of disaster-relevant information from social media , 2013, WWW.

[28]  Axel Schulz,et al.  Semantic Abstraction for generalization of tweet classification: An evaluation of incident-related tweets , 2016, Semantic Web.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Andrea H. Tapia,et al.  Good Enough is Good Enough: Overcoming Disaster Response Organizations’ Slow Social Media Data Adoption , 2014, Computer Supported Cooperative Work (CSCW).

[31]  Hassan Sajjad,et al.  Rapid Classification of Crisis-Related Data on Social Networks using Convolutional Neural Networks , 2016, ICWSM 2016.

[32]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[33]  Cornelia Caragea,et al.  Towards Practical Usage of a Domain Adaptation Algorithm in the Early Hours of a Disaster , 2017, ISCRAM.

[34]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Christian Reuter,et al.  Social Media in Crisis Management: An Evaluation and Analysis of Crisis Informatics Research , 2018, Int. J. Hum. Comput. Interact..