BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  R. Nevatia,et al.  TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Bernard Ghanem,et al.  End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.

[6]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[7]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Bernard Ghanem,et al.  ActivityNet Challenge 2017 Summary , 2017, ArXiv.

[11]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[12]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16]  Wei Li,et al.  CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016 , 2016, ArXiv.

[17]  Xu Zhao,et al.  Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017 , 2017, ArXiv.

[18]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[19]  Nikos Komodakis,et al.  LocNet: Improving Localization Accuracy for Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[22]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Jitendra Malik,et al.  DeepBox: Learning Objectness with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Bernard Ghanem,et al.  SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[28]  Ramakant Nevatia,et al.  Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[29]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Larry S. Davis,et al.  Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[33]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[35]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[36]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[37]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[38]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[39]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[40]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[41]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Limin Wang,et al.  A Pursuit of Temporal Accuracy in General Activity Detection , 2017, ArXiv.

[43]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.

[44]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).