FTvNF: fault tolerant virtual network functions

One of the major concerns about Network Function Virtualization (NFV) is the reduced stability of virtual network functions (VNFs), compared to dedicated hardware appliances. Stateful VNFs make recovery a complex process, where a major concern is how to handle non-determinism such as multi-threaded processing, time dependence, and randomness. In this paper we present FTvNF --- a new approach for network functions recovery with very low overhead in failure-free time. This is in contrast to previous suggestions to take snapshots of the VNF state at certain checkpoints or to store the VNF state externally. Compared with state-of-the-art approaches, our approach significantly reduces the latency overhead incurred by the network elements, both in failure-free operations and when failures occur. In addition, our approach better suits the common case of NFV service chaining, as our mechanisms are applied once per chain, thus significantly improve the performance over approaches that treat each VNF separately.

[1]  Jesse Gross,et al.  Geneve: Generic Network Virtualization Encapsulation , 2020, RFC.

[2]  Martín Casado,et al.  The Design and Implementation of Open vSwitch , 2015, NSDI.

[3]  Jian Li,et al.  COLO: COarse-grained LOck-stepping virtual machines for non-stop service , 2013, SoCC.

[4]  Peter M. Chen,et al.  Execution replay of multiprocessor virtual machines , 2008, VEE '08.

[5]  Ori Rottenstreich,et al.  Optimizing virtual backup allocation for middleboxes , 2016, 2016 IEEE 24th International Conference on Network Protocols (ICNP).

[6]  Franck Le,et al.  Stateless Network Functions: Breaking the Tight Coupling of State and Processing , 2017, NSDI.

[7]  Aditya Akella,et al.  OpenNF , 2014, SIGCOMM.

[8]  Ori Rottenstreich,et al.  Designing Optimal Middlebox Recovery Schemes With Performance Guarantees , 2018, IEEE Journal on Selected Areas in Communications.

[9]  Minlan Yu,et al.  FlowTags: enforcing network-wide policies in the presence of dynamic middlebox actions , 2013, HotSDN '13.

[10]  L. Miles,et al.  2000 , 2000, RDH.

[11]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[12]  Minlan Yu,et al.  Enforcing Network-Wide Policies in the Presence of Dynamic Middlebox Actions using FlowTags , 2014, NSDI.

[13]  Hani Jamjoom,et al.  Pico replication: a high availability framework for middleboxes , 2013, SoCC.

[14]  Ori Rottenstreich,et al.  Designing Optimal Middlebox Recovery Schemes with Performance Guarantees , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[15]  Navendu Jain,et al.  Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.

[16]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[17]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[18]  Jason Nieh,et al.  Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010, SIGMETRICS '10.

[19]  Anat Bremler-Barr,et al.  OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions , 2016, SIGCOMM.

[20]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[21]  EDDIE KOHLER,et al.  The click modular router , 2000, TOCS.

[22]  Jacob R. Lorch,et al.  Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services , 2015, NSDI.

[23]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[24]  Scott Shenker,et al.  Rollback-Recovery for Middleboxes , 2015, Comput. Commun. Rev..