Sentiment Categorization on a Creole Language with Lexicon-Based and Machine Learning Techniques

We propose polarity detection from colloquial expressions distinctive of a bilingual population. The hybrid language we address it's called "Jopara", composed by Spanish and Guaraní, spoken in Paraguay, similar to the "Louisiana's Creole" in the United States. We categorize polarity in three classes (positive, negative and neutral) and address this problem by applying both lexicon-based and machine-learning approaches. In this document it's shown the application scenario, the building process of the bilingual lexicon and the attributes preprocessing to create the classifiers' input. The input data is retrieved from Twitter so the expressions are similar to natural language. Finally, results are displayed to compare performance of these techniques when applied on this kind of language. It's shown that classical classifiers have very good performances, with correction rates of over 80% even with small training sets, if their parameters are properly adjusted along with an adequate selection of attributes.