Docker environment based Apache Storm and Spark Benchmark Test

With the development of various technologies such as high-speed Internet and SNS dissemination, there have been many fields that require processing of big data generated in real time. Accordingly, real-time streaming data processing technology has been developed, and representative platforms include Apache Storm, Apache Spark, and Hadoop. These processing technologies provide scalability to configure distributed systems using multiple servers because they vary in performance, such as throughput and processing speed, depending on the server environment, but the more the number of servers, the more difficult it is to manage. To solve this problem, a problem can be solved by using a docker, a kind of virtualization system that provides ease of expansion. However, there is a place to maintain a native environment without using Docker due to the problem that performance may be reduced, which is a disadvantage of all virtualization systems. In this paper, we build Apache Storm and Apache Spark, which are real-time data processing systems in Docker and Native environments and conduct performance measurements through experiments processing JSON-format data to verify how much performance decreases in Docker environments.

[1]  Nam Thoai,et al.  Using Docker in high performance computing applications , 2016, 2016 IEEE Sixth International Conference on Communications and Electronics (ICCE).

[2]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[3]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[4]  César A. F. De Rose,et al.  Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[5]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[7]  Eui-nam Huh,et al.  Proposal of Container-Based HPC Structures and Performance Analysis , 2018, J. Inf. Process. Syst..

[8]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.