Matrix multiplication on digital signal processors and hierarchical memory systems

We discuss mapping the matrix multiplication algorithm onto a two-level hierarchical memory system which incorporates DMA capabilities between levels, as available on Digital Signal Processors (DSPs). We show that it is possible to hide the hierarchical nature of the memory system from the processor, so that computations can proceed at the processor’s speed. This is accomplished by the use of a block algorithm, and by prefetching data from the slower second-level memory into the faster but smaller first-level memory under DMA control. The Texas Instruments TMS 320C30 Digital Signal Processor is used as an example, and performance estimates for different memory timings are given. These results are also compared to the performance of executing the matrix multiplication algorithm without exploiting the DMA capabilities.