Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model

The trend of learning from videos instead of documents has increased. There could be hundreds and thousands of videos on a single topic, with varying degrees of context, content, and depth of the topic. The literature claims that learners are nowadays less interested in viewing a complete video but prefers the topic of their interests. This develops the need for indexing of video lectures. Manual annotation or topic-wise indexing is not new in the case of videos. However, manual indexing is time-consuming due to the length of a typical video lecture and intricate storylines. Automatic indexing and annotation are, therefore, better and efficient solutions. This research aims to identify the need for automatic video indexing for better information retrieval and ease users navigating topics inside a video. The automatically identified topics are referred to as “Index Points.” 137-layer YoloV4 Darknet Neural Network creates a custom object detector model. The model is trained on approximately 6000 video frames and then tested on a suite of 50 videos of around 20 hours of run time. Shot Boundary detection is performed using Structural Similarity fused with a Binary Search Pattern algorithm which outperformed the state-of-the-art SSIM technique by reducing the processing time to approximately 21% and providing around 96% accuracy. Generation of accurate index points in terms of true positives and false negatives is detected through precision, recall, and F1 score, which varies between 60–80% for every video. The results show that the proposed algorithm successfully generates a digital index with reasonable accuracy in topic detection.