Watershed‐ng: an extensible distributed stream processing framework

Most high‐performance data processing (a.k.a. big data) systems allow users to express their computation using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most frameworks, however, do not allow users to specify how communication must take place: That element is deeply embedded into the run‐time system abstractions, making changes hard to implement. In this work, we describe Wathershed‐ng, our re‐engineering of the Watershed system, a framework based on the filter–stream paradigm and originally focused on continuous stream processing. Like other big‐data environments, Watershed provided object‐oriented abstractions to express computation (filters), but the implementation of streams was a run‐time system element. By isolating stream functionality into appropriate classes, combination of communication patterns and reuse of common message handling functions (like compression and blocking) become possible. The new architecture even allows the design of new communication patterns, for example, allowing users to choose MPI, TCP, or shared memory implementations of communication channels as their problem demands. Applications designed for the new interface showed reductions in code size on the order of 50% and above in some cases. The performance results also showed significant improvements, because some implementation bottlenecks were removed in the re‐engineering process. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Wagner Meira,et al.  Watershed: A High Performance Distributed Stream Processing System , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[6]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[7]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[8]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[9]  Wagner Meira,et al.  Watershed reengineering: Making Streams Programmable , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[10]  Vinod Kumar Vavilapalli,et al.  Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 , 2014 .

[11]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[12]  Scott O. Bradner,et al.  The Recommendation for the IP Next Generation Protocol , 1995, RFC.

[13]  Lúcia Maria de A. Drummond,et al.  Anthill: a scalable run-time environment for data mining applications , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[14]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[15]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Srinivasan Parthasarathy,et al.  Asynchronous and Anticipatory Filter-Stream Based Parallel Algorithm for Frequent Itemset Mining , 2004, PKDD.

[18]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).