A Model of Preprocessing For Social Media Data Extraction

Tropical disease grows fast and requires detection. One source of data for detections is social media Twitter. However, social media data has data with diverse data structures primarily from unstructured user syntax and grammar. Therefore the twitter message (tweet) must be purified by a preprocessing method involving part of speech (POS) rule. This paper proposes a preprocessing model for twitter data to obtain a clean dataset. There are some steps. Firstly, we use Out-of-Vocabulary (OOV) word to analyse the tweets from Indonesian texts. Secondly, in stemming step, we use Sastrawi library. We also compare the result of tokenization and combining with Out-of-vocabulary (OOV) Word, Stemming N-Gram, and Stop Word Removal Sastrawi library into a well preprocessing approach. From the experimental result, we can get the result of Preprocessing task related to tweet data characteristic in Indonesia language. We can conclude that our form get more valuable result in terms of meaningful word occurrence comparing to the result obtained by just running common preprocessing tasks.

[1]  Mohamed F. Tolba,et al.  A New Stemming Algorithm for Efficient Information Retrieval Systems and Web Search Engines , 2017 .

[2]  Alex Clark Pre-processing very noisy text , 2003 .

[3]  Muhammad Abdul-Mageed,et al.  SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[4]  J. Utzinger,et al.  Toward an Open-Access Global Database for Mapping, Control, and Surveillance of Neglected Tropical Diseases , 2011, PLoS neglected tropical diseases.

[5]  Gail M Williams,et al.  Internet-based surveillance systems for monitoring emerging infectious diseases , 2013, The Lancet Infectious Diseases.

[6]  Vadlamani Ravi,et al.  A survey on opinion mining and sentiment analysis: Tasks, approaches and applications , 2015, Knowl. Based Syst..

[7]  Gunther Eysenbach,et al.  Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. , 2011, American journal of preventive medicine.

[8]  Usman Qamar,et al.  TOM: Twitter opinion mining framework using hybrid classification scheme , 2014, Decis. Support Syst..

[9]  Brian H. Spitzberg,et al.  The Reliability of Tweets as a Supplementary Method of Seasonal Influenza Surveillance , 2014, Journal of medical Internet research.

[10]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[11]  Zainab Abu Bakar,et al.  Effectiveness of Stemming and ngrams String Similarity Matching on Malay Documents , 2011 .

[12]  Naomie Salim,et al.  An improved plagiarism detection scheme based on semantic role labeling , 2012, Appl. Soft Comput..

[13]  A F Hidayatullah,et al.  Pre-processing Tasks in Indonesian Twitter Messages , 2017 .

[14]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[15]  Novita Hanafiah,et al.  Text Normalization Algorithm on Twitter in Complaint Category , 2017, ICCSCI.

[16]  Ahmad Fathan Hidayatullah Language tweet characteristics of Indonesian citizens , 2015, 2015 International Conference on Science and Technology (TICST).

[17]  Ali Selamat,et al.  Hybrid sentiment classification on twitter aspect-based sentiment analysis , 2018, Applied Intelligence.

[18]  Usman Qamar,et al.  A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set , 2015, Comput. Math. Methods Medicine.

[19]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[20]  Benjamin C. M. Fung,et al.  Subject-based semantic document clustering for digital forensic investigations , 2013, Data Knowl. Eng..