Boosting Big Data Streaming Applications in Clouds With BurstFlow

The rapid growth of stream applications in financial markets, health care, education, social media, and sensor networks represents a remarkable milestone for data processing and analytic in recent years, leading to new challenges to handle Big Data in real-time. Traditionally, a single cloud infrastructure often holds the deployment of Stream Processing applications because it has extensive and adaptative virtual computing resources. Hence, data sources send data from distant and different locations of the cloud infrastructure, increasing the application latency. The cloud infrastructure may be geographically distributed and it requires to run a set of frameworks to handle communication. These frameworks often comprise a Message Queue System and a Stream Processing Framework. The frameworks explore Multi-Cloud deploying each service in a different cloud and communication via high latency network links. This creates challenges to meet real-time application requirements because the data streams have different and unpredictable latencies forcing cloud providers’ communication systems to adjust to the environment changes continually. Previous works explore static micro-batch demonstrating its potential to overcome communication issues. This paper introduces BurstFlow, a tool for enhancing communication across data sources located at the edges of the Internet and Big Data Stream Processing applications located in cloud infrastructures. BurstFlow introduces a strategy for adjusting the micro-batch sizes dynamically according to the time required for communication and computation. BurstFlow also presents an adaptive data partition policy for distributing incoming streams across available machines by considering memory and CPU capacities. The experiments use a real-world multi-cloud deployment showing that BurstFlow can reduce the execution time up to 77% when compared to the state-of-the-art solutions, improving CPU efficiency by up to 49%.

[1]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[2]  Aiiad Albeshri,et al.  UbiPriSEQ—Deep Reinforcement Learning to Manage Privacy, Security, Energy, and QoS in 5G IoT HetNets , 2020, Applied Sciences.

[3]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[4]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[5]  Yang Song,et al.  Adaptive Block and Batch Sizing for Batched Stream Processing System , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[6]  Dalvan Griebler,et al.  Improving the Network Performance of a Container-Based Cloud Environment for Hadoop Systems , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Gabriel Antoniu,et al.  JetStream: Enabling high throughput live event streaming on multi-site clouds , 2016, Future Gener. Comput. Syst..

[8]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[9]  Natalia G. Miloslavskaya,et al.  Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[10]  Joanna Berlinska,et al.  Comparing load-balancing algorithms for MapReduce under Zipfian data skews , 2018, Parallel Comput..

[11]  Julio C. S. dos Anjos,et al.  Analysis and Performance Evaluation of Deep Learning on Big Data , 2019, 2019 IEEE Symposium on Computers and Communications (ISCC).

[12]  Asterios Katsifodimos,et al.  Apache Flink: Stream Analytics at Scale , 2016, 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW).

[13]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[14]  Yanrong Li,et al.  A Memory Management Strategy Based on Task Requirement for In-Memory Computing , 2020, 2020 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC).

[15]  Scott Shenker,et al.  Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[16]  Gilles Fedak,et al.  Enabling Strategies for Big Data Analytics in Hybrid Infrastructures , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[17]  May Thet Tun,et al.  Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming , 2019, 2019 International Conference on Advanced Information Technologies (ICAIT).

[18]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[19]  Rashid Mehmood,et al.  Distributed Artificial Intelligence-as-a-Service (DAIaaS) for Smarter IoE and 6G Environments , 2020, Sensors.

[20]  Yi Pan,et al.  Effective Multi-stream Joining in Apache Samza Framework , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[21]  Jemal H. Abawajy,et al.  Comprehensive analysis of big data variety landscape , 2015, Int. J. Parallel Emergent Distributed Syst..

[22]  Li Yang,et al.  Dynamic memory-aware scheduling in spark computing environment , 2020, J. Parallel Distributed Comput..

[23]  Tolga Ovatman,et al.  A Decentralized Replica Placement Algorithm for Edge Computing , 2018, IEEE Transactions on Network and Service Management.

[24]  Rajkumar Buyya,et al.  Internet of Things: Principles and Paradigms , 2016 .

[25]  M. Hilbert,et al.  Big Data for Development: A Review of Promises and Challenges , 2016 .

[26]  Huadong Ma,et al.  Resource-Aware Cache Management for In-Memory Data Analytics Frameworks , 2019, 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).

[27]  Valderi R. Q. Leithardt,et al.  Data Processing Model to Perform Big Data Analytics in Hybrid Infrastructures , 2020, IEEE Access.

[28]  Rashid Mehmood,et al.  Iktishaf: a Big Data Road-Traffic Event Detection Tool Using Twitter and Spark Machine Learning , 2020 .

[29]  Keyan Cao,et al.  An Overview on Edge Computing Research , 2020, IEEE Access.