Binary Representation and High Efficient Compression of 3D CNN Features for Action Recognition

A common framework of the action recognition is to collect the videos from different cameras into a cloud center firstly, and then perform the 3D CNN on the cloud server. Although directly, this framework will bring a huge burden to the cloud server and video transmission. To handle this challenge, the "front-cloud" collaborative processing architecture can be used. The most import issue is to compress the feature from 3D CNN effectively without significant loss of accuracy. We propose logarithmic quantization with a maximum value threshold and HEVC inter encoding for 3D CNN features. Experimental results on ResNet-50 and InceptionV1 show that the features can be represented by only 1 bit without significant loss of accuracy. The compression ratio of the quantized 1 bit features using HEVC inter coding can reach to 5000 times and the loss of accuracy is less than 1%.