Video localized caption generation framework for industrial videos

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS（2022）

引用 0|浏览7

暂无评分

摘要

In this information age, there is exponential growth in visual content and video captioning can address many real-life applications. Automatic generation of video captions can be beneficial to comprehend a video in a short time, assist in faster information retrieval, video analysis, indexing, report generation, etc. Captioning of industrial videos is of importance to get a visual and textual summary of the work ongoing in the industry. The generated captioned summary of the video can assist in remote monitoring of industries and these captions can be utilized for video question-answering, video segment extraction, productivity analysis, etc. Due to the presence of diverse events processing of industrial videos are more challenging compared to other domains. In this paper, we address the real-life application of generating the descriptions for the videos of a labor-intensive industry. We propose a keyframe-based approach for the generation of video captions. The framework produces a video summary by extraction of keyframes, thereby reducing the video captioning task to image captioning. These keyframes are passed to the image captioning model for description generation. Utilizing these individual frame captions, multi-caption descriptions of a video are generated with a unique start and end time of each caption. For image captioning, a merge encoder-decoder model with a stacked decoder for caption generation is used. We have performed experimentation on a dataset specifically created for the small-scale industry. We have also shown that data augmentation on the small dataset can greatly benefit the generation of remarkably good video descriptions. Results of extensive experimentation performed by utilizing different image encoders, language encoders, and decoders in the merge encoder-decoder model are reported. Apart from presenting the results on domain-specific data, results on domain-independent datasets are also presented to show the applicability of the technique in general. Performance comparison with existing datasets - OVSD and Flickr8k and Flickr30k are reported to demonstrate the scalability of our method.

查看译文

关键词

Video Localized Captioning, Keyframe extraction, Video Segmentation, Image Captioning, Merge Encoder-Decoder models, Stacked Bi-LSTM Decoder, Deep Learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要