Heterogeneous Multicore Architecture

In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater flexibility, it is necessary to develop parallel processing on chips by taking advantage of the advances being made in semiconductor integration. Figure 2.1 illustrates the basic architecture of our heterogeneous multicore [1, 2]. Several low-power CPU cores and special purpose processor (SPP) cores, such as a digital signal processor, a media processor, and a dynamically reconfigurable processor, are embedded on a chip. In the figure, the number of CPU cores is m. There are two types of SPP cores, SPPa and SPPb, on the chip. The values n and k represent the respective number of SPPa and SPPb cores. Each processor core includes a processing unit (PU), a local memory (LM), and a data transfer unit (DTU) as the main elements. The PU executes various kinds of operations. For example, in a CPU core, the PU includes arithmetic units, register files, a program counter, control logic, etc., and executes machine instructions. With some SPP cores like the dynamic reconfigurable processor, the PU executes a large quantity of data in parallel using its array of arithmetic units. The LM is a small-size and low-latency memory and is mainly accessed by the PU in the same core during the PU’s execution. Some cores may have caches as well as an LM or may only have caches without an LM. The LM is necessary to meet the real-time requirements of embedded systems. The access time to a cache is non-deterministic because of cache misses. On the other hand, the access to an LM is deterministic. By putting a program and data in the LM, we can accurately estimate the execution cycles of a program that has hard real-time requirements. A data transfer unit (DTU) is also embedded in the core to achieve parallel execution of internal operation in the core and data transfer operations between cores and memories. Each PU in a core processes the data on its LM or its cache, and the DTU simultaneously executes memory-to-memory data transfer between cores. The DTU is like a direct memory controller (DMAC) and executes a command that transfers data between several kinds of memories, then checks and waits for the end of the data transfer, etc. Some DTUs are capable of command chaining, where multiple commands are executed in order. The frequency and voltage controller (FVC) connected to each core controls the frequency, voltage, and power supply of each core independently and reduces the total power consumption of the chip. If the frequencies or power supplies of the core’s PU, DTU, and LM can be independently controlled, the FVC can vary their frequencies and power supplies individually. For example, the FVC can stop the frequency of the PU and run the frequencies of the DTU and LM when the core is executing only data transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory that is commonly used by cores. Each core is connected to the on-chip interconnect, which may be several types of buses or crossbar switches. The chip is also connected to the off-chip main memory, which has a large capacity but high latency.