Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model
arxiv(2024)
摘要
Existing works have made strides in video generation, but the lack of sound
effects (SFX) and background music (BGM) hinders a complete and immersive
viewer experience. We introduce a novel semantically consistent v ideo-to-audio
generation framework, namely SVA, which automatically generates audio
semantically consistent with the given video content. The framework harnesses
the power of multimodal large language model (MLLM) to understand video
semantics from a key frame and generate creative audio schemes, which are then
utilized as prompts for text-to-audio models, resulting in video-to-audio
generation with natural language as an interface. We show the satisfactory
performance of SVA through case study and discuss the limitations along with
the future research direction. The project page is available at
https://huiz-a.github.io/audio4video.github.io/.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要