Appraising SPARK on Large-Scale Social Media Analysis

Software systems for social media analysis provide algorithms and tools for extracting useful knowledge from user-generated social media data. ParSoDA (Parallel Social Data Analytics) is a Java library for developing parallel data analysis applications based on the extraction of useful knowledge from social media data. This library aims at reducing the programming skills necessary to implement scalable social data analysis applications. This work describes how the ParSoDA library has been extended to execute applications on Apache Spark. Using a cluster of 12 workers, the Spark version of the library reduces the execution time of two case study applications exploiting social media data up to 42%, compared to the Hadoop version of the library.

[1]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[2]  Eugenio Cesario,et al.  Following soccer fans from geotagged tweets at FIFA World Cup 2014 , 2015, 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM).

[3]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[4]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[6]  Sihem Amer-Yahia,et al.  SOCLE: Towards a framework for data preparation in social applications , 2014, Ingénierie des Systèmes d Inf..

[7]  David F. Barrero,et al.  A FRAMEWORK FOR MASSIVE TWITTER DATA EXTRACTION AND ANALYSIS , 2014 .

[8]  Sihem Amer-Yahia,et al.  SOCLE : Vers un cadre de préparation des données dans les applications sociales , 2014 .

[9]  Domenico Talia,et al.  Big Data Analysis on Clouds , 2017, Handbook of Big Data Technologies.

[10]  Domenico Talia,et al.  Data Analysis in the Cloud , 2015 .

[11]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[12]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Domenico Talia,et al.  A Parallel Library for Social Media Analytics , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[15]  Ben O'Loughlin,et al.  Social Media Analysis and Public Opinion: The 2010 UK General Election , 2015, J. Comput. Mediat. Commun..

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Linlin You,et al.  Social data analysis framework in cloud and Mobility Analyzer for Smarter Cities , 2014, Proceedings of 2014 IEEE International Conference on Service Operations and Logistics, and Informatics.

[18]  Ravikiran Vatrapu,et al.  Social Data Analytics Tool (SODATO) , 2014, DESRIST.

[19]  Eugenio Cesario,et al.  Analyzing social media data to discover mobility patterns at EXPO 2015: Methodology and results , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[22]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.