RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

Memory copying is one of the most common operations in modern software. Usually, the operation reflects a synchronous (sync) CPU procedure of memory copying, incurring overheads such as cache pollution and CPU stalling, especially in the scenario of bulk copying with large data. To improve this issue, some works based on I/OAT, which is a dedicated and popular hardware copying engine on Intel platform, is proposed but still exists several problems: (1) lacking atomic allocation/revocation at the granularity of I/OAT channel; (2) deficiency of interrupt support and (3) complicated programming interfaces. We propose RAMCI, an asynchronous (async) memory copying mechanism based on Intel I/OAT engine, not only improves the sync overheads, but also overcomes the above three issues through (1) a lock mechanism by using low-level CAS instruction; (2) a lightweight interrupt mechanism for the completion of memory copying, instead of using the polling pattern which consuming large CPU resource and (3) a group of well-defined and abstract interfaces, allowing the programmers to utilize the underlying free I/OAT channels transparently. To support the interfaces, a novel scheduler of the I/OAT channels is introduced. It splits the source copying data into several pieces, and each of them can be allocated with a dedicated I/OAT channel intelligently to transfer the data with parallelism. We evaluate RAMCI and compare it with other memory copying mechanisms in four NUMA scenarios. The experimental results show that RAMCI improves memory copying performance up to 4.68 $$\times $$ while achieving almost full ability of parallel computing.

[1]  Dimitris Mitropoulos,et al.  POSIX abstractions in modern operating systems: the old, the new, and the missing , 2016, EuroSys.

[2]  Stamatis Vassiliadis,et al.  A Load/Store Unit for a Memcpy Hardware Accelerator , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[3]  Mianxiong Dong,et al.  Deep Reinforcement Scheduling for Mobile Crowdsensing in Fog Computing , 2019, ACM Trans. Internet Techn..

[4]  Dhabaleswar K. Panda,et al.  Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Qi Zhang,et al.  A New Disk I/O Model of Virtualized Cloud Environment , 2013, IEEE Transactions on Parallel and Distributed Systems.

[6]  Mianxiong Dong,et al.  Rule caching in SDN-enabled mobile access networks , 2015, IEEE Network.

[7]  Jianbin Fang,et al.  Parallel programming models for heterogeneous many-cores: a comprehensive survey , 2020, CCF Transactions on High Performance Computing.

[8]  Yan Solihin,et al.  Architecture Support for Improving Bulk Memory Copying and Initialization Performance , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[9]  Mianxiong Dong,et al.  SEER-MCache: A Prefetchable Memory Object Caching System for IoT Real-Time Data Processing , 2018, IEEE Internet of Things Journal.

[10]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[11]  Mianxiong Dong,et al.  ECCN: Orchestration of Edge-Centric Computing and Content-Centric Networking in the 5G Radio Access Network , 2018, IEEE Wireless Communications.

[12]  Yong Tang,et al.  Towards High-Efficient Transaction Commitment in a Virtualized and Sustainable RDBMS , 2019 .

[13]  Keir Fraser,et al.  A Practical Multi-word Compare-and-Swap Operation , 2002, DISC.

[14]  Stamatis Vassiliadis,et al.  A hardware cache memcpy accelerator , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[15]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Stephan Wong,et al.  Cache-Based Memory Copy Hardware Accelerator for Multicore Systems , 2010, IEEE Transactions on Computers.

[17]  Shuai Yu,et al.  CEFL: Online Admission Control, Data Scheduling, and Accuracy Tuning for Cost-Efficient Federated Learning Across Edge Nodes , 2020, IEEE Internet of Things Journal.

[18]  Ligang He,et al.  Redundant Network Traffic Elimination with GPU Accelerated Rabin Fingerprinting , 2016, IEEE Transactions on Parallel and Distributed Systems.

[19]  Yutong Lu,et al.  Improving the efficiency of HPC data movement on container-based virtual cluster , 2020, CCF Trans. High Perform. Comput..

[20]  Laxmi N. Bhuyan,et al.  Hardware Support for Accelerating Data Movement in Server Platform , 2007, IEEE Transactions on Computers.

[21]  Laxmi N. Bhuyan,et al.  Hardware support for bulk data movement in server platforms , 2005, 2005 International Conference on Computer Design.

[22]  Gang Cao,et al.  SPDK: A Development Kit to Build High Performance Storage Applications , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[23]  Su Liu,et al.  A Processor-DMA-Based Memory Copy Hardware Accelerator , 2011, 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage.

[24]  Long Zheng,et al.  Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture , 2020, CCF Transactions on High Performance Computing.

[25]  Dingding Li,et al.  Low‐overhead inline deduplication for persistent memory , 2020 .

[26]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[27]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[28]  Xu Chen,et al.  Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing , 2019, Proceedings of the IEEE.

[29]  Hao Chen,et al.  Optimizing Graph Processing on GPUs , 2017, IEEE Transactions on Parallel and Distributed Systems.

[30]  Hai Jin,et al.  Software-defined QoS for I/O in exascale computing , 2019, CCF Trans. High Perform. Comput..