Aten: A Dispatcher for Big Data Applications in Heterogeneous Systems

Stream Processing Engines (SPEs) have to support high data ingestion to ensure the quality and efficiency for the end-user or a system administrator. The data flow processed by SPE fluctuates over time, and requires real-time or near real-time resource pool adjustments (network, memory, CPU and other). This scenario leads to the problem known as skewed data production caused by the non-uniform incoming flow at specific points on the environment, resulting in slow down of applications caused by network bottlenecks and inefficient load balance. This work proposes Aten as a solution to overcome unbalanced data flows processed by Big Data Stream applications in heterogeneous systems. Aten manages data aggregation and data streams within message queues, assuming different algorithms as strategies to partition data flow over all the available computational resources. The paper presents preliminary results indicating that is possible to maximize the throughput and also provide low latency levels for SPEs.

[1]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[2]  Gianmarco De Francisci Morales,et al.  Partial Key Grouping: Load-Balanced Partitioning of Distributed Streams , 2015, ArXiv.

[3]  Didier Donsez,et al.  Roboconf: A Hybrid Cloud Orchestrator to Deploy Complex Applications , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[4]  César A. F. De Rose,et al.  Understanding performance interference in multi-tenant cloud databases and web applications , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[5]  Gabriel Antoniu,et al.  JetStream: Enabling high throughput live event streaming on multi-site clouds , 2016, Future Gener. Comput. Syst..

[6]  Calton Pu,et al.  Enabling Elastic Stream Processing in Shared Clusters , 2016, 2016 IEEE 9th International Conference on Cloud Computing (CLOUD).

[7]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[8]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[9]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[10]  Jemal H. Abawajy,et al.  Comprehensive analysis of big data variety landscape , 2015, Int. J. Parallel Emergent Distributed Syst..

[11]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[12]  A. Gilles,et al.  The Art of Computer Systems Performance Analysis (Techniques for Experimental Design, Measurement, Simulation, and Modeling) , 1992 .

[13]  Nasser Ghadiri,et al.  Linked data partitioning for RDF processing on Apache Spark , 2017, 2017 3th International Conference on Web Research (ICWR).

[14]  César A. F. De Rose,et al.  A Performance Isolation Analysis of Disk-Intensive Workloads on Container-Based Clouds , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[15]  Stanley B. Zdonik,et al.  Integrating real-time and batch processing in a polystore , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Yi Pan,et al.  Effective Multi-stream Joining in Apache Samza Framework , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[17]  Yoshiro Ikura,et al.  Efficient scheduling algorithms for a single batch processing machine , 1986 .

[18]  Shrideep Pallickara,et al.  NEPTUNE: Real Time Stream Processing for Internet of Things and Sensing Environments , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[20]  Didier Donsez,et al.  CIRUS: an elastic cloud-based framework for Ubilytics , 2016, Ann. des Télécommunications.

[21]  Patrick P. C. Lee,et al.  LD-Sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[22]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[23]  Luc Bougé,et al.  A performance evaluation of Apache Kafka in support of big data streaming applications , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Shrideep Pallickara,et al.  Online Scheduling and Interference Alleviation for Low-Latency, High-Throughput Processing of Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[26]  Dalvan Griebler,et al.  Improving the Network Performance of a Container-Based Cloud Environment for Hadoop Systems , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[27]  Raymond H. Putra,et al.  Load Balancing for Skewed Streams on Heterogeneous Cluster , 2017, ArXiv.

[28]  Gilles Fedak,et al.  SMART: An Application Framework for Real Time Big Data Analysis on Heterogeneous Cloud Environments , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[29]  Scott Shenker,et al.  Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[30]  Rajkumar Buyya,et al.  Distributed data stream processing and edge computing: A survey on resource elasticity and future directions , 2017, J. Netw. Comput. Appl..

[31]  Thomas J. Lampoltshammer,et al.  Strategies for Big Data Analytics through Lambda Architectures in Volatile Environments , 2017, ArXiv.

[32]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[33]  Valeria Cardellini,et al.  Elastic stateful stream processing in storm , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[34]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[35]  Luciana Arantes,et al.  MRA++: Scheduling and data placement on MapReduce for heterogeneous environments , 2015, Future Gener. Comput. Syst..

[36]  Natalia G. Miloslavskaya,et al.  Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[37]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[38]  Asterios Katsifodimos,et al.  Apache Flink: Stream Analytics at Scale , 2016, 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW).