Subword Variation in Text Message Classification

For millions of people in less resourced regions of the world, text messages (SMS) provide the only regular contact with their doctor. Classifying messages by medical labels supports rapid responses to emergencies, the early identification of epidemics and everyday administration, but challenges include text-brevity, rich morphology, phonological variation, and limited training data. We present a novel system that addresses these, working with a clinic in rural Malawi and texts in the Chichewa language. We show that modeling morphological and phonological variation leads to a substantial average gain of F=0.206 and an error reduction of up to 63.8% for specific labels, relative to a baseline system optimized over word-sequences. By comparison, there is no significant gain when applying the same system to the English translations of the same texts/labels, emphasizing the need for subword modeling in many languages. Language independent morphological models perform as accurately as language specific models, indicating a broad deployment potential.

[1]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[2]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[3]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[4]  Solomon Teferra Abate,et al.  Morpheme-Based and Factored Language Modeling for Amharic Speech Recognition , 2009, LTC.

[5]  Mervyn A. Jack,et al.  A usability comparison of three alternative message formats for an SMS banking service , 2008, Int. J. Hum. Comput. Stud..

[6]  Etienne Barnard,et al.  Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages , 2009 .

[7]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[8]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[9]  Laurette Pretorius,et al.  Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography , 2009 .

[10]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[11]  Jason Whalley,et al.  The impact of mobile telephony on developing country micro-enterprise: A nigerian case study , 2008 .

[12]  Sarah Jane Delany,et al.  An Assessment of Case Base Reasoning for Short Text Message Classification , 2004 .

[13]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[14]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[15]  Guy De Pauw,et al.  The SAWA Corpus: A Parallel Corpus English - Swahili , 2009 .

[16]  Ana Deumert,et al.  Mobile language choices — The use of English and isiXhosa in text messages (SMS): Evidence from a bilingual South African sample , 2008 .

[17]  Debbie A. Travers,et al.  Evaluation of preprocessing techniques for chief complaint classification , 2008, J. Biomed. Informatics.

[18]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[19]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[20]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[21]  David Wheeler,et al.  Determinants of a Digital Divide in Sub-Saharan Africa: A Spatial Econometric Analysis of Cell Phone Coverage , 2008 .

[22]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.

[23]  Steven Paas,et al.  English-Chichewa-Chinyanja Dictionary , 2003 .

[24]  Sam Mchombo,et al.  The syntax of Chichewa , 2004 .

[25]  Mathias Creutz,et al.  Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition , 2006 .

[26]  Alon Lavie,et al.  ParaMor: Finding Paradigms across Morphology , 2008, CLEF.