Subword and Spatiotemporal Models for Identifying Actionable Information in Haitian Kreyol

Crisis-affected populations are often able to maintain digital communications but in a sudden-onset crisis any aid organizations will have the least free resources to process such communications. Information that aid agencies can actually act on, 'actionable' information, will be sparse so there is great potential to (semi)automatically identify actionable communications. However, there are hurdles as the languages spoken will often be under-resourced, have orthographic variation, and the precise definition of 'actionable' will be response-specific and evolving. We present a novel system that addresses this, drawing on 40,000 emergency text messages sent in Haiti following the January 12, 2010 earthquake, predominantly in Haitian Kreyol. We show that keyword/ngram-based models using streaming MaxEnt achieve up to F=0.21 accuracy. Further, we find current state-of-the-art subword models increase this substantially to F=0.33 accuracy, while modeling the spatial, temporal, topic and source contexts of the messages can increase this to a very accurate F=0.86 over direct text messages and F=0.90-0.97 over social media, making it a viable strategy for message prioritization.

[1]  Mervyn A. Jack,et al.  A usability comparison of three alternative message formats for an SMS banking service , 2008, Int. J. Hum. Comput. Stud..

[2]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[3]  WagnerWiebke Steven Bird, Ewan Klein and Edward Loper , 2010, LREC 2010.

[4]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[5]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[6]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[7]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[8]  Sarah Jane Delany,et al.  An Assessment of Case Base Reasoning for Short Text Message Classification , 2004 .

[9]  Jason Whalley,et al.  The impact of mobile telephony on developing country micro-enterprise: A nigerian case study , 2008 .

[10]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[11]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[12]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[13]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[14]  Luisa Maffi,et al.  LINGUISTIC, CULTURAL, AND BIOLOGICAL DIVERSITY , 2005 .

[15]  Robert Munro Crowdsourced translation for emergency response in Haiti: the global collaboration of local knowledge , 2010, AMTA.

[16]  Xin Zhang,et al.  Fast mining of spatial collocations , 2004, KDD.

[17]  Christopher D. Manning,et al.  Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data , 2010, ACL.

[18]  William Lewis,et al.  Haitian Creole: How to Build and Ship an MT Engine from Scratch in 4 days, 17 hours, & 30 minutes , 2010, EAMT.

[19]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[20]  Jeannie A. Stamberger,et al.  Tweak the tweet: Leveraging microblogging proliferation with a prescriptive syntax to support citizen reporting , 2010, ISCRAM.

[21]  Christopher D. Manning,et al.  Subword Variation in Text Message Classification , 2010, NAACL.

[22]  Hui Xiong,et al.  Mining confident co-location rules without a support threshold , 2003, SAC '03.

[23]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[24]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[25]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[26]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[27]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[28]  Yong Shi,et al.  Categorizing and mining concept drifting data streams , 2008, KDD.

[29]  Kirill Kireyev Applications of Topics Models to Analysis of Disaster-Related Twitter Data , 2009 .

[30]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.

[31]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[32]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[33]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.