Input data organization for batch processing in time window based computations

Applications based on event processing are often designed to continuously evaluate set of events defined by sliding time windows. Solutions employing long-running continuous queries executed in-memory show their limits in applications characterized by a staggering growth of available sources that continuously produce new events at high rates (e.g. intrusion detection systems and algorithmic trading). Problems arise due to the complexities in maintaining large amounts of events in memory for continuous elaboration, and due to the difficulties in managing at run-time the network of elaborating nodes. A batch approach to this kind of computation provides a viable solution for scenarios characterized by non frequent computations of very large time windows. In this paper we propose a model for batch processing in time window event computations that allows the definition of multiple metrics for performance optimization. These metrics specifically take into account the organization of input data to minimize its impact on computation latency. The model is then instantiated on Hadoop, a batch processing engine based on the MapReduce paradigm, and a set of strategies for efficiently arranging input data is described and evaluated.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[3]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[4]  Giuseppe Antonio Di Luna,et al.  A Collaborative Event Processing System for Protection of Critical Infrastructures from Cyber Attacks , 2011, SAFECOMP.

[5]  Opher Etzion,et al.  A stratified approach for supporting high throughput event processing applications , 2009, DEBS '09.

[6]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[7]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[10]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[11]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[12]  Chunming Rong,et al.  Performance Considerations of Data Acquisition in Hadoop System , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[15]  Calton Pu,et al.  Infopipes for composing distributed information flows , 2001, M3W.

[16]  Dan Meng,et al.  Transformer: A New Paradigm for Building Data-Parallel Programming Models , 2010, IEEE Micro.

[17]  Elke A. Rundensteiner,et al.  Run-time operator state spilling for memory intensive long-running queries , 2006, SIGMOD Conference.

[18]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[19]  Opher Etzion,et al.  Event Processing in Action , 2010 .

[20]  Philip S. Yu,et al.  Adaptive load shedding for windowed stream joins , 2005, CIKM '05.

[21]  Shantenu Jha,et al.  Programming Abstractions for Data Intensive Computing on Clouds and Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[22]  Huan Liu,et al.  GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[23]  Philip S. Yu,et al.  Adaptive Load Diffusion for Multiway Windowed Stream Joins , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[25]  Yi Han,et al.  Performance Analysis of Hadoop for Query Processing , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[26]  Andy Konwinski,et al.  Chukwa: A large-scale monitoring system , 2008 .

[27]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[28]  Giuseppe Antonio Di Luna,et al.  Collaborative Detection of Coordinated Port Scans , 2013, ICDCN.

[29]  Roberto Baldoni,et al.  Collaborative Financial Infrastructure Protection , 2012, Springer Berlin Heidelberg.