Summarizing Lecture Videos by Key Handwritten Content Regions

We introduce a novel method for summarization of whiteboard lecture videos using key handwritten content regions. A deep neural network is used for detecting bounding boxes that contain semantically meaningful groups of handwritten content. A neural network embedding is learnt, under triplet loss, from the detected regions in order to discriminate between unique handwritten content. The detected regions along with embeddings at every frame of the lecture video are used to extract unique handwritten content across the video which are presented as the video summary. Additionally, a spatiotemporal index is constructed from the video which records the time and location of each individual summary region in the video which can potentially be used for content-based search and navigation. We train and test our methods on the publicly available AccessMath dataset. We use the DetEval scheme to benchmark our summarization by recall of unique ground truth objects (92.09%) and average number of summary regions (128) compared to the ground truth (88).

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fei Yin,et al.  Online Video Text Detection with Markov Decision Process , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[3]  Xu-Cheng Yin,et al.  Scene Text Detection in Video by Learning Locally and Globally , 2016, IJCAI.

[4]  Kenny Davila,et al.  Whiteboard Video Summarization via Spatio-Temporal Conflict Minimization , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Fu-Hao Yeh,et al.  Robust Handwriting Extraction and Lecture Video Summarization , 2014, IIH-MSP.

[7]  Sébastien Ourselin,et al.  Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations , 2017, DLMIA/ML-CDS@MICCAI.

[8]  Dimosthenis Karatzas,et al.  Single Shot Scene Text Retrieval , 2018, ECCV.

[9]  Venu Govindaraju,et al.  Automated Detection of Handwritten Whiteboard Content in Lecture Videos for Summarization , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[10]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gernot A. Fink,et al.  A Method for Camera-Based Interactive Whiteboard Reading , 2011, CBDAR.

[15]  Gang Hua,et al.  Topical video object discovery from key frames by modeling word co-occurrence prior , 2013, IEEE Transactions on Image Processing.

[16]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[18]  Changming Sun,et al.  An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sudholt Sebastian,et al.  PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents , 2016 .

[21]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.