Hybrid System for Information Extraction from Social Media Text: Drug Abuse Case Study

Abstract Social media are becoming widely used in the healthcare field as a patients-caregivers communication tool giving birth to new sources of information rich with the knowledge that may improve this field. Therefore, social media data analysis becomes a real business requirement for healthcare industrials and data scientists. However, regarding their complexity and unstructured character, existing natural language processing tools cannot succeed their exploitation. In the literature, a wide range of approaches appeared based on dictionaries, linguistic patterns and machine learning having their strengths and weaknesses. In this work, we propose a hybrid system combining the above approaches by taking the advantage of each of them to extract structured and salient drug abuse information from health-related tweets. We improve the system accuracy by real time update of the domain dictionary. We collected 1000000 tweets and we conducted different experiments showing the advantage of hybridization on efficient information extraction from social media data.