Resource Sharing and Security Implications on Machine Learning Inference Accelerators

Due to the increasing adoption of Machine Learning (ML) and in particular Deep Learning (DL), many specialized energy efficient accelerators are being proposed by academia and industry. A number of these accelerators are designed to run a single application at a time in exclusive access mode. This approach gives applications maximum performance but reduces resource efficiency, resulting in increased costs over time. Sharing the device among multiple jobs increases resource utilization and amplifies return on investment. This study is driven by a broad investigation of various spatial resource sharing strategies in machine learning hardware accelerators and performance evaluation in a novel memristor-based accelerator called PUMA [1]. Two methods of spatial sharing are discussed: Model Packing and Logical Allocation. Simulations showed that both methods can be implemented on the PUMA accelerator and have advantages in terms of increased resource utilization. The former spatial sharing strategy achieves higher level of parallelism, fitting more models per device (7 models on 11 tiles), but has higher interference overhead (up to 49%), still being in most cases better than the overhead found for GPUs. The latter spatial sharing strategy achieves better isolation with almost no interference overhead (<1%) with the cost of leaving resources unused (same 7 models consumed 16 tiles). Finally, we discuss security implications of resource sharing for ML and other concerns, presenting a novel ML model integrity check and model bias verification.