Selective off-loading to Memory: Task Partitioning and Mapping for PIM-enabled Heterogeneous Systems

Processing-in-Memory (PIM) is returning as a promising solution to address the issue of memory wall as computing systems gradually step into the big data era. Researchers continually proposed various PIM architecture combined with novel memory device or 3D integration technology, but it is still a lack of universal task scheduling method in terms of the new heterogeneous platform. In this paper, we propose a formalized model to quantify the performance and energy of the PIM+CPU heterogeneous parallel system. In addition, we are the first to build a task partitioning and mapping framework to exploit different PIM engines. In this framework, an application is divided into subtasks and mapped onto appropriate execution units based on the proposed PIM-oriented Earliest-Finish-Time (PEFT) algorithm to maximize the performance gains brought by PIM. Experimental evaluations show our PIM-aware framework significantly improves the system performance compared to conventional processor architectures.

[1]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[2]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[3]  Pedro López,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[4]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[6]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Julio Sahuquillo,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007 .

[8]  Richard Johnson Efficient program analysis using dependence flow graphs , 1995 .

[9]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.