Text Normalization Algorithm on Twitter in Complaint Category

Abstract Many people use microblog to express complaint or criticism. However, the limitation of the length that can be written is about 160 characters and the text is in unstructured sentence. It becomes the biggest obstacle to process the information. Those unstructured sentences cause a difficulty for preprocessing in text processing tools. Therefore, normalization is needed to make the unstructured sentences can be more understandable by a machine. We proposed a normalization of Indonesian language method which adopting some ideas of normalization from other researchers and adjust to the problem of Indonesian characteristic in unstructured sentence. The experiment exploits Twitter data which use Indonesian language in complaint category. The process is divided into three stages, which are cleaning process, OOV detection and word replacement. List of Basic words and Slang dictionary are used in the OOV detection. On the other hand, Context dictionary is built to solve the ambiguity problem. The algorithm can reaches the accuracy about 90% in a complaint category.