SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space
arxiv(2024)
摘要
Combining face swapping with lip synchronization technology offers a
cost-effective solution for customized talking face generation. However,
directly cascading existing models together tends to introduce significant
interference between tasks and reduce video clarity because the interaction
space is limited to the low-level semantic RGB space. To address this issue, we
propose an innovative unified framework, SwapTalk, which accomplishes both face
swapping and lip synchronization tasks in the same latent space. Referring to
recent work on face generation, we choose the VQ-embedding space due to its
excellent editability and fidelity performance. To enhance the framework's
generalization capabilities for unseen identities, we incorporate identity loss
during the training of the face swapping module. Additionally, we introduce
expert discriminator supervision within the latent space during the training of
the lip synchronization module to elevate synchronization quality. In the
evaluation phase, previous studies primarily focused on the self-reconstruction
of lip movements in synchronous audio-visual videos. To better approximate
real-world applications, we expand the evaluation scope to asynchronous
audio-video scenarios. Furthermore, we introduce a novel identity consistency
metric to more comprehensively assess the identity consistency over time series
in generated facial videos. Experimental results on the HDTF demonstrate that
our method significantly surpasses existing techniques in video quality, lip
synchronization accuracy, face swapping fidelity, and identity consistency. Our
demo is available at http://swaptalk.cc.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要