TweetLID: a benchmark for tweet language identification

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.

[1]  Stefan Riezler,et al.  Twitter Translation using Translation-Based Cross-Lingual Retrieval , 2012, WMT@NAACL-HLT.

[2]  Chepovskiy Andrey,et al.  Language identification for texts written in transliteration , 2012 .

[3]  Marcos Zampieri,et al.  Using bag-of-words to distinguish similar languages: How efficient are they? , 2013, 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI).

[4]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[5]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[6]  Harald Hammarstr-om A Fine-Grained Model for Language Identification , 2007 .

[7]  Viviana Mascardi,et al.  Statistical Language Identification of Short Texts , 2011, ICAART.

[8]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[9]  Carol Myers-Scotton,et al.  Contact Linguistics: Bilingual encounters and grammatical outcomes , 2013 .

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  Reynier Ortega Bueno,et al.  Tweets Language Identification using Feature Weighting , 2014, TweetLID@SEPLN.

[12]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[13]  Lluís Padró,et al.  Comparing methods for language identification , 2004, Proces. del Leng. Natural.

[14]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[15]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[16]  Anil Kumar Singh,et al.  A Language Identification Method Applied to Twitter Data , 2014, TweetLID@SEPLN.

[17]  Ralf D. Brown,et al.  Finding and identifying text in 900+ languages , 2012, Digit. Investig..

[18]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[19]  Daniel Horowitz,et al.  TweetSafa: Tweet Language Identification , 2014, TweetLID@SEPLN.

[20]  David Vilares,et al.  Identificación Automática del Idioma en Twitter: Adaptación de Identificadores del Estado del Arte al Contexto Ibérico , 2014, TweetLID@SEPLN.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Martin Majlis,et al.  Yet Another Language Identifier , 2012, EACL.

[23]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[24]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[25]  Heng Ji,et al.  Analysis and Enhancement of Wikification for Microblogs with Context Expansion , 2012, COLING.

[26]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[27]  Timothy Baldwin,et al.  Cross-domain Feature Selection for Language Identification , 2011, IJCNLP.

[28]  Ming-Wei Chang,et al.  To Link or Not to Link? A Study on End-to-End Tweet Entity Linking , 2013, NAACL.

[29]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[30]  Andrew McCallum,et al.  Generalized expectation criteria for lightly supervised learning , 2011 .

[31]  Ralf D. Brown,et al.  Selecting and Weighting N-Grams to Identify 1100 Languages , 2013, TSD.

[32]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[33]  Wouter Weerkamp,et al.  Microblog language identification: overcoming the limitations of short, unedited and idiomatic text , 2012, Language Resources and Evaluation.

[34]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[35]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[36]  Radim Rehurek,et al.  Language Identification on the Web: Extending the Dictionary Method , 2009, CICLing.

[37]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[38]  Eugénio C. Oliveira,et al.  Determining language variant in microblog messages , 2013, SAC '13.

[39]  Arkaitz Zubiaga,et al.  TweetNorm_es: an annotated corpus for Spanish microtext normalization , 2014, LREC.

[40]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[41]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[42]  Fei Xia,et al.  Language ID in the Context of Harvesting Language Data off the Web , 2009, EACL.

[43]  Brendan T. O'Connor,et al.  TweetMotif: Exploratory Search and Topic Summarization for Twitter , 2010, ICWSM.

[44]  Neny Isharyanti,et al.  Code-switching and code-mixing in Internet chatting: between 'yes', 'ya', and 'si'-a case study , 2009 .

[45]  Cédrick Fairon,et al.  Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, incorporating cleaneval , 2007 .

[46]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[47]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[48]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[49]  Monika Henzinger,et al.  Web page language identification based on URLs , 2008, Proc. VLDB Endow..

[50]  Anil Kumar Singh Study of Some Distance Measures for Language and Encoding Identification , 2006 .

[51]  Marc Najork,et al.  Boot-Strapping Language Identifiers for Short Colloquial Postings , 2013, ECML/PKDD.

[52]  Julio Gonzalo,et al.  Towards real-time summarization of scheduled events from twitter streams , 2012, HT '12.

[53]  John C. Paolillo "Conversational" Codeswitching on Usenet and Internet Relay Chat , 2011 .

[54]  Jordi Porta,et al.  Twitter Language Identification using Rational Kernels and its potential application to Sociolinguistics , 2014, TweetLID@SEPLN.

[55]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[56]  Timothy Baldwin,et al.  Multilingual Language Identification: ALTW 2010 Shared Task Data , 2010, ALTA.

[57]  N. Mikelic,et al.  Language Indentification: How to Distinguish Similar Languages? , 2007, 2007 29th International Conference on Information Technology Interfaces.

[58]  Gen-ichiro Kikui,et al.  Identifying the Coding System and Language of On-line Documents on the Internet , 1996, COLING.

[59]  Mykola Pechenizkiy,et al.  Graph-Based N-gram Language Identication on Short Texts , 2011 .

[60]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[61]  Stefanie Nowak,et al.  Performance measures for multilabel evaluation: a case study in the area of image classification , 2010, MIR '10.

[62]  Monojit Choudhury,et al.  "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification , 2014, ICON.

[63]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[64]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[65]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[66]  Emilio Sanchis Arnal,et al.  ELiRF-UPV en TweetLID: Identificación del Idioma en Twitter , 2014, TweetLID@SEPLN.

[67]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[68]  Tommi Vatanen,et al.  Language Identification of Short Text Segments with N-gram Models , 2010, LREC.

[69]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[70]  Arkaitz Zubiaga,et al.  Overview of TweetLID: Tweet Language Identification at SEPLN 2014 , 2014, TweetLID@SEPLN.

[71]  José Ramom Pichel Campos,et al.  Comparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets , 2014, TweetLID@SEPLN.