Normalization of Abbreviation and Acronym on Microtext in Bahasa Indonesia by Using Dictionary-Based and Longest Common Subsequence (LCS)

Abstract The communication nowadays has reached a need to express the idea in short text. This kind of communication is delivered in various media such as short messages service (SMS), Facebook status, Twitter post, chat messages, comments, and any form of short text. These various kinds of short text are known as microtext. The microtext usually has one sentence or less, written informally, consists of abbreviations, acronyms, emoticons, hashtags, and others. These features of the microtext become a particular challenge to the text classification. These features cannot be processed directly as in the traditional text processing, because it may lead to inaccuracy. Therefore, it requires microtext normalization to transform these features into well-written texts before applying text processing. This research aims to normalize some of these features, which are abbreviations and acronyms. The normalization applied dictionary-based and longest common subsequence (LCS) approach to the microtext in Bahasa Indonesia. Dictionary-based has shown an excellenct performance instead of LCS. However, it is limited to pre-defined abbreviations and acronyms. Besides, the LCS offers dynamic microtext normalization. Nevertheless, the combination of both approaches increases normalization performance slightly.