A Weight-Sharing Autoencoder with Dynamic Quantization for Efficient Feature Compression

Collaborative inference (CI) enhances the inference efficiency of deep neural networks (DNNs) by partitioning a computational workload between an edge device and a cloud platform. Efficient inference using CI requires searching for the optimal partition layer that minimizes the end-to-end inference latency. In addition, the intermediate feature at the partitioned layer should be effectively compressed. However, recent DNN-based feature compression methods require independent models dedicated for each partition point, resulting in significant storage overhead. In this paper, we propose a novel method that efficiently compresses the features from variable partition layers using a single autoencoder. The proposed method incorporates a weight-sharing technique that shares the weights of autoencoders that compress each partition layer. In addition, dynamic bitwidths quantization is supported for flexibility in compression ratio. The experimental results show that the proposed method reduced the required parameter size by 4× compared to the existing independent model based method, while maintaining the accuracy loss within 0.5%.