Towards efficient server architecture for virtualized network function deployment: Implications and implementations

Recent years have seen a revolution in network infrastructure brought on by the ever-increasing demands for data volume. One promising proposal to emerge from this revolution is Network Functions Virtualization (NFV), which has been widely adopted by service and cloud providers. The essence of NFV is to run network functions as virtualized workloads on commodity Standard High Volume Servers (SHVS), which is the industry standard. However, our experience using NFV when deployed on modern NUMA-based SHVS paints a frustrating picture. Due to the complexity in the NFV data plane and its service function chain feature, modern NFV deployment on SHVS exhibits a unique processing pattern - heterogeneous software pipeline (HSP), in which the NFV traffic flows must be processed by heterogeneous software components sequentially from the NIC to the end re-ceiver. Since the end-to-end performance of flows is cooperatively determined by the performance of each processing stage, the resource allocation/mapping scheme in NUMA-based SHVS must consider a thread-dependence scheduling to tradeoff the impact of co-located contention and remote packet transmission. In this paper, we develop a thread scheduling mechanism that collaboratively places threads of HSP to minimize the end-to-end performance slowdown for NFV traffic flow. It employs a dynamic programming-based method to search for the optimal thread mapping with negligible overhead. To serve this mechanism, we also develop a performance slowdown estimation model to accurately estimate the performance slowdown at each stage of HSP. We implement our collaborative thread scheduling mechanism on a real system and evaluate it using real workloads. On average, our algorithm outperforms state-of-the-art NUMA-aware and contention-aware scheduling policies by at least 7% on CPU utilization and 23% on traffic throughput with negligible computational overhead (less than 1 second).

[1]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[2]  VachharajaniNeil,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011 .

[3]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[4]  Pankaj Garg,et al.  NVGRE: Network Virtualization Using Generic Routing Encapsulation , 2015, RFC.

[5]  Yu Chen,et al.  Scalable Kernel TCP Design and Implementation for Short-Lived Connections , 2016, ASPLOS.

[6]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[7]  Onur Mutlu,et al.  Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems , 2012, ACM Trans. Comput. Syst..

[8]  Benjamin C. Lee,et al.  Modeling communication costs in blade servers , 2015, HotPower '15.

[9]  Martín Casado,et al.  Network Virtualization in Multi-tenant Datacenters , 2014, NSDI.

[10]  Tao Li,et al.  Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[11]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Lei Liu,et al.  Rethinking Memory Management in Modern Operating System: Horizontal, Vertical or Random? , 2016, IEEE Transactions on Computers.

[13]  Gurindar S. Sohi,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.

[14]  Longjun Liu,et al.  HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers , 2016, ICS.

[15]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Alan L. Cox,et al.  Hyper-Switch: A Scalable Software Virtual Switching Architecture , 2013, USENIX Annual Technical Conference.

[20]  Jingling Yuan,et al.  Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[21]  Scott A. Mahlke,et al.  When less is more (LIMO):controlled parallelism forimproved efficiency , 2012, CASES '12.

[22]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Longjun Liu,et al.  Towards sustainable in-situ server systems in the big data era , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[24]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[25]  Xiaowei Yang,et al.  High performance network virtualization with SR-IOV , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[26]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Robert Tappan Morris,et al.  Improving network connection locality on multicore systems , 2012, EuroSys '12.

[28]  Wei Wang,et al.  Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[30]  Lei Liu,et al.  Going vertical in memory management: Handling multiplicity by multi-policy , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31]  Martín Casado,et al.  The Design and Implementation of Open vSwitch , 2015, NSDI.

[32]  K. K. Ramakrishnan,et al.  NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms , 2014, IEEE Transactions on Network and Service Management.