A spatio-temporal deep architecture for surveillance event detection based on ConvLSTM

Accurate event detection in surveillance videos is one of the most challenging tasks in computer vision since there is enormous noise produced by unwanted events. In this paper, we propose a method which concentrates on the target event by detecting person's key-pose while combines the temporal information describing the key pose changes over time. Explicitly, we propose a recurrent model based on ConvLSTM integrated with temporal pooling (CLITP) to capture temporal representations as well as spatial features. In addition, our model can deal with variable-length sequences and work well on small datasets. And we conduct experiments on canonical surveillance event detection datasets, TRECVID SED dataset and multiple cameras fall dataset. Our method synthesizing both spatial and temporal information shows very competitive results compared with the state-of-the-art methods.