D-Sieve: A Novel Data Processing Engine for Efficient Handling of Crises-Related Social Messages

Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classification processing step leveraging upon two additional content features-stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a "best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.

[1]  Fernando Diaz,et al.  CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises , 2014, ICWSM.

[2]  Axel Bruns,et al.  Tools and methods for capturing Twitter data during natural disasters , 2012, First Monday.

[3]  Sarah Vieweg,et al.  Processing Social Media Messages in Mass Emergency , 2014, ACM Comput. Surv..

[4]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[5]  Reynold Cheng,et al.  On incentive-based tagging , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[6]  Carlos Castillo,et al.  AIDR: artificial intelligence for disaster response , 2014, WWW.

[7]  Sihem Amer-Yahia,et al.  Tweet4act: Using incident-specific profiles for classifying crisis-related messages , 2013, ISCRAM.

[8]  Fernando Diaz,et al.  Emergency-relief coordination on social media: Automatically matching resource requests and offers , 2013, First Monday.

[9]  Jie Yin,et al.  Using Social Media to Enhance Emergency Situation Awareness , 2012, IEEE Intelligent Systems.

[10]  Fernando Diaz,et al.  Extracting information nuggets from disaster- Related messages in social media , 2013, ISCRAM.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[13]  Ari Rappoport,et al.  Enhanced Sentiment Learning Using Twitter Hashtags and Smileys , 2010, COLING.

[14]  Jie Yin,et al.  Emergency situation awareness from twitter for crisis management , 2012, WWW.