Memory-Centric Architecture of Neural Processing Unit for Edge Device

The deep learning algorithms require high computational power and memory usage to provide better performance for users. However, the edge devices use different sizes of deep learning networks depending on the application and require hardware optimization due to the limited hardware resources. To minimize the hardware redesign efforts by the networks, we propose a Neural Processing Unit (NPU) hardware consisting of one SRAM and 16 Processing Element (PEs) that enables various parallel configurations. In this paper, we introduce the NPU hardware details and several combinations of parallel hardware structure. We also demonstrate that our hardware can handle a variety of networks by describing hardware behavior under the data configuration that is written to SRAM