A Highly Efficient Consolidated Platform for Stream Computing and Hadoop

Data Stream Processing or stream computing is the new computing paradigm for processing a massive amount of streaming data in real-time without storing them in secondary storage. In this paper we propose an integrated execution platform for Data Stream Processing and Hadoop with dynamic load balancing mechanism to realize an efficient operation of computer systems and reduction of latency of Data Stream Processing. Our implementation is built on top of System S, a distributed data stream processing system developed by IBM Research. Our experimental results show that our load balancing mechanism could increase CPU usage from 47.77% to 72.14% when compared to the one with no load balancing. Moreover, the result shows that latency for stream processing jobs are kept low even in a bursty situation by dynamically allocating more compute resources to stream processing jobs.

[1]  Kun-Lung Wu,et al.  A code generation approach to optimizing high-performance distributed data stream processing , 2009, CIKM.

[2]  Pedro Furtado,et al.  StreamNetFlux: birth of transparent integrated CEP-DBs , 2010, DEBS '10.

[3]  Asser N. Tantawi,et al.  Dynamic placement for clustered web applications , 2006, WWW '06.

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[6]  Alexandru Iosup,et al.  Trace-based evaluation of job runtime and queue wait time predictions in grids , 2009, HPDC '09.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Kenji Yamanishi,et al.  A unifying framework for detecting outliers and change points from non-stationary time series data , 2002, KDD.

[9]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[10]  Jordi Torres,et al.  Enabling Resource Sharing between Transactional and Batch Workloads Using Dynamic Application Placement , 2008, Middleware.

[11]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.

[12]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[13]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[14]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[15]  Kun-Lung Wu,et al.  A Code Generation Approach for Auto-Vectorization in the Spade Compiler , 2009, LCPC.

[16]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[17]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[18]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[19]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.