Seeing Bot

We demonstrate a video captioning bot, named Seeing Bot, which can generate a natural language description about what it is seeing in near real time. Specifically, given a live streaming video, Seeing Bot runs two pre-learned and complementary captioning modules in parallel - one for generating image-level caption for each sampled frame, and the other for generating video-level caption for each sampled video clip. In particular, both the image and video captioning modules are boosted by incorporating semantic attributes which can enrich the generated descriptions, leading to human-level caption generation. A visual-semantic embedding model is then exploited to rank and select the final caption from the two parallel modules by considering the semantic relevance between video content and the generated captions. The Seeing Bot finally converts the generated description to speech and sends the speech to an end user via an earphone. Our demonstration is conducted on any videos in the wild and supports live video captioning.

[1]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[4]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[5]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[9]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[14]  Tao Mei,et al.  Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding , 2016, ACM Multimedia.

[15]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).