Streaming Social Media Data Analysis for Events Extraction and Warehousing using Hadoop and Storm: Drug Abuse Case Study

Abstract In the age of big data, entreprises’ information systems are ingested with data generated from social media which raises the need to integrate it in their business intelligence process for better decision making. However, these new data, streaming, voluminous, unstructured and variant, bring existing data warehousing systems and integration tools to their knees which motivated us to conduct this research work. In this paper, we propose a large scale system based on distributed storage and parallel processing to succeed social media data warehousing. In fact, we combine Storm and Hadoop for structured events extraction from social media data and their integration in the data warehouse. We take the advantage of real time analysis of streaming data offered by Storm and batch processing of large volumes of data of Hadoop which facilitated streaming social media data analysis task. For conceptual representation, we propose a customized multidimensional model in which we add an intermediate table to connect the social media data warehouse with the enterprise data warehouse. We implement it using Oracle 12c and we fed it with events extracted from 1000 000 tweets using Pentaho data integration tool.