论文信息 - Empirical autopsy of deep video captioning encoder-decoder architecture - 字舞流文

Empirical autopsy of deep video captioning encoder-decoder architecture

Ajmal Mian | Naveed Akhtar | Nayyer Aafaq | Wei Liu | A. Mian | Naveed Akhtar | Wei Liu | Nayyer Aafaq

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Yongdong Zhang,et al. STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[3] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[8] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9] Yuxin Peng,et al. Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Xuelong Li,et al. MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning , 2017, IJCAI.

[11] Lei Guo,et al. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[12] Xueming Qian,et al. Semantic Annotation of High-Resolution Satellite Images via Weakly Supervised Learning , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[13] Sheng Liu,et al. SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Tieniu Tan,et al. M3: Multimodal Memory Modelling for Video Captioning , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Wei Liu,et al. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Qingming Huang,et al. Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[17] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[18] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[20] Dacheng Tao,et al. Syntax-Aware Action Targeting for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22] Trevor Darrell,et al. Identity-Aware Multi-Sentence Video Description , 2020, ECCV.

[23] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[24] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Zhe Gan,et al. Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[28] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Feiping Nie,et al. Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering , 2017, IEEE Transactions on Image Processing.

[30] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[31] Yu-Wing Tai,et al. Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Wojciech Zaremba,et al. An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[33] Xin Wang,et al. Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[35] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[36] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[37] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[38] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[44] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).