X-Layer: Building Composable Pipelined Dataflows for Low-Rank Convolutions

Prior research in hardware accelerators has largely focused on spatial convolutions (CONV). However, state-of-the-art DNNs employ low-rank convolutions (LR-CONV). LR-CONVs such as depthwise and pointwise convolutions exhibit lower arithmetic intensity and lower data re-use. LR-CONV s result in low hardware utilization and high latency. However, they provide opportunities for inter-layer data reuse. We propose X-Layer, which systematically explores the design space of cross-layer dataflows. We develop novel fine-grain cross-layer dataflows for LR-CONVs that support partial loop dimension completion. X-Layer decouples the nested loops in a pipeline and combines them to create a common outer dataflow and several inner dataflows. X-layer discovers additional opportunities for optimizing LR-CONVs: i) it overlaps adjacent layers at fine-granularity with partially completed channels and filters. This minimizes the intermediate storage required. ii) it enables each pipelined layer to independently choose optimal outer and inner dataflows by supporting streaming activation transformations. We explore a large design space of cross-layer dataflows and evaluate them for depth-separable, inverted residual, and CONV layers across six different DNNs. We also find that coarse-grain dataflows are sensitive to on-chip memory (≥ 1.5 MB) and performance drops steeply if enough on-chip SRAM is not provided. X-Layer dataflows find optimal performance across a wide range of on-chip memory (≥ 32KB). Compared to the existing coarse-grain and medium-grain dataflows, X-Layer improves the performance by 7.8× and 16.6×, while requiring 8.3× and 2× less SRAM.