CatDetect, a framework for detecting Catalan tweets

This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API.

[1]  Wouter Weerkamp,et al.  Microblog language identification: overcoming the limitations of short, unedited and idiomatic text , 2012, Language Resources and Evaluation.

[2]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[5]  Lior Rokach,et al.  Data Mining with Decision Trees - Theory and Applications. 2nd Edition , 2013, Series in Machine Perception and Artificial Intelligence.

[6]  Gilles Louppe,et al.  Scikit-learn: Machine Learning Without Learning the Machinery , 2015, GETMBL.

[7]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[8]  Ralf D. Brown,et al.  Selecting and Weighting N-Grams to Identify 1100 Languages , 2013, TSD.

[9]  Richard T. Gillam Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard , 2002 .

[10]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[11]  Bruno Gas,et al.  Language Detection combining discriminating approach and temporal decision with neural network modeling , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[12]  Yudivián Almeida-Cruz,et al.  Detección De Idioma En Twitter (Language Detection on Twitter) , 2014 .

[13]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[14]  P. Lewis Ethnologue : languages of the world , 2009 .

[15]  Viviana Mascardi,et al.  Statistical Language Identification of Short Texts , 2011, ICAART.

[16]  Revista Metamorfosis,et al.  L`emigrant 2.0. Emigració juvenil, nous moviments socials i xarxes digitals , 2018 .

[17]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[18]  J. Shawe-Taylor Kernel Methods and Support Vector Machines , 2014 .

[19]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[20]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Lior Rokach,et al.  Data Mining with Decision Trees - Theory and Applications , 2007, Series in Machine Perception and Artificial Intelligence.

[23]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[24]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[25]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[26]  Damien Ernst,et al.  On overfitting and asymptotic bias in batch reinforcement learning with partial observability , 2017, J. Artif. Intell. Res..

[27]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.