Cut, Distil and Encode (CDE): Split Cloud-Edge Deep Inference

In cloud-edge environments, running all Deep Neural Network (DNN) models on the cloud causes significant network congestion and high latency, whereas the exclusive use of the edge device for execution limits the size and structure of the DNN, impacting accuracy. This paper introduces a novel partitioning approach for DNN inference between the edge and the cloud. This is the first work to consider simultaneous optimization of both the memory usage at the edge and the size of the data to be transferred over the wireless link. The experiments were performed on two different network architectures, MobileNetV1 and VGG16. The proposed approach makes it possible to execute part of the network on very constrained devices (e.g., microcontrollers), and under poor network conditions (e.g., LoRa) whilst retaining reasonable accuracies. Moreover, the results show that the choice of the optimal layer to split the network depends on the bandwidth and memory constraints, whereas prior work suggests that the best choice is always to split the network at higher layers. We demonstrate superior performance compared to existing techniques.