An Experience Report on Building a Big Data Analytics Framework Using Cloudera CDH and RapidMiner Radoop with a Cluster of Commodity Computers

Many real-world data are not only large in volume but also heterogeneous and fast generated. This type of data, known as big data, typically cannot be analyzed by using traditional software tools and techniques. Although an open-source software project, Apache Hadoop, has been successfully developed and used for handling big data, its setup and configuration complexity including its requirement to learn other additional related tools have hindered non-technical researchers and educators from actually entering the area of big data analytics. To support big-data community, this paper describes procedures and experiences gained from building a big data analytics framework, and demonstrates its usage on a popular case study, Twitter sentiment analysis. The framework comprises a cluster of four commodity computers run by Cloudera CDH 6.0.1 and RapidMiner Studio 9.3 with Text Processing, Hive Connector, and Radoop extensions. According to the study results, setting up a big data analytics framework on a cluster of computers does not require advanced computer knowledge but needs meticulous system configurations to satisfy system installation and software integration requirements. Once all setup and configurations are correctly done, data analysis can be readily performed using visual workflow designers provided by RapidMiner. Finally, the framework is further evaluated on a large data set of 185 million records, “TalkingData AdTracking Fraud Detection” data set. The outcome is very satisfied and proves that the framework is easy to use and can practically be deployed for big data analytics.

[1]  Suresh Chalasani,et al.  Predictive analytics on Electronic Health Records (EHRs) using Hadoop and Hive , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[2]  Pu Han,et al.  The research on Chinese document clustering based on WEKA , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[3]  Todor Ivanov,et al.  Performance Evaluation of Enterprise Big Data Platforms with HiBench , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[4]  Amardeep Singh,et al.  Big Data: Hadoop framework vulnerabilities, security issues and attacks , 2019, Array.

[5]  N. Bogunovic,et al.  An overview of free software tools for general data mining , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[6]  Jorge Bernardino,et al.  Big Data Analytics: A Preliminary Study of Open Source Platforms , 2017, ICSOFT.

[7]  Ajay Lala,et al.  Sentiment Analysis of English Tweets Using Rapid Miner , 2015, 2015 International Conference on Computational Intelligence and Communication Networks (CICN).

[8]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[9]  Duen Horng Chau,et al.  Building a research data science platform from industrial machines , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Abdulrahman H. Altalhi,et al.  Evaluation and comparison of open source software suites for data mining and knowledge discovery , 2017, WIREs Data Mining Knowl. Discov..

[11]  Leonardo Feltrin KNIME an Open Source Solution for Predictive Analytics in the Geosciences [Software and Data Sets] , 2015, IEEE Geoscience and Remote Sensing Magazine.