OSIRIS-SR: a scalable yet reliable distributed workflow execution engine

Workflows provide an easy to use programming model for the construction of complex services that are (recursively) composed of simpler services. When it comes to high performance workflow execution, the distribution (outscaling) of the constituent services of the workflow across an environment of computational nodes is a key concept and also a very straightforward advantage of the workflow paradigm. However, scalable workflow execution cannot only be provided by the distribution of services but also necessitates novel architectures for the workflow engine in charge of service orchestration. Even though workflow orchestration is commonly provided by centralized solutions, these architectures imply performance bottlenecks and single points of failure. Hence, the workflow engine has to be distributed as well, by efficiently replicating workflow metadata across several nodes in a network. A particular challenge stems from the requirement of providing scalable workflow execution that is at the same time also reliable. In this paper, we present OSIRIS-SR, a decentralized middleware for the distributed execution of workflows. It has particularly been designed to jointly provide a high degree of scalability and reliability. OSIRIS-SR locally leverages the concurrent and redundant Actor model for workflow processing, whereas globally OSIRIS-SR runs a number of scalable system services for the management of workflow metadata, with the Safety Ring being the most prominent one. The Safety Ring service features a self-healing node overlay for the purpose of active workflow instance supervision that serves at the same time as a scalable and reliable metadata storage. We discuss in detail the Safety Ring architecture and the mechanics behind the scalable and reliable workflow management in OSIRIS-SR. The evaluation results of OSIRIS-SR show that support for reliable workflow execution does not significantly impact the system's scalability characteristics.

[1]  Seif Haridi,et al.  Enhanced Paxos Commit for Transactions on DHTs , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[2]  Rachid Guerraoui,et al.  Introduction to Reliable and Secure Distributed Programming (2. ed.) , 2011 .

[3]  Heiko Schuldt,et al.  Peer-to-peer Execution of (transactional) Processes , 2005, Int. J. Cooperative Inf. Syst..

[4]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[5]  Xinfeng Ye Towards a Reliable Distributed Web Service Execution Engine , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[6]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[7]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[8]  Toshio Matsuura,et al.  Toward Fault-Tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers , 2009, 2009 First International Conference on Advances in P2P Systems.

[9]  Joe Armstrong,et al.  A history of Erlang , 2007, HOPL.

[10]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[11]  Heiko Schuldt,et al.  COMPASS - Optimized Routing for Efficient Data Access in Mobile Chord-Based P2P Systems , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[12]  Heiko Schuldt,et al.  OSIRIS-SR: A Safety Ring for self-healing distributed composite service execution , 2012, 2012 7th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS).

[13]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[14]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[15]  Heiko Schuldt,et al.  Scalable peer-to-peer process management , 2006, Int. J. Bus. Process. Integr. Manag..

[16]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[17]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[18]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[19]  Axel W. Krings,et al.  Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing , 2009, IEEE Transactions on Dependable and Secure Computing.

[20]  Ramandeep Kaur,et al.  Antecedence graph based checkpointing and recovery for mobile agents , 2010, 2010 INTERNATIONAL CONFERENCE ON COMMUNICATION CONTROL AND COMPUTING TECHNOLOGIES.

[21]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..

[22]  Hans-Arno Jacobsen,et al.  The PADRES Distributed Publish/Subscribe System , 2005, FIW.

[23]  Weihai Yu Fault handling and recovery in decentralized services orchestration , 2010, iiWAS.

[24]  Seif Haridi,et al.  Symmetric Replication for Structured Peer-to-Peer Systems , 2005, DBISP2P.

[25]  Rachid Guerraoui,et al.  Introduction to Reliable and Secure Distributed Programming , 2011 .