SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Abstract

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the public dataset HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity and identity consistency.

Approach

Our model is capable of transferring the facial region of a user-defined personalized avatar (source ID) onto a specified target template, while also accommodating lip shape deformations to ensure that the lip movements in the generated video are synchronized with the user-specified audio content.

Quantitative Experiments

Quantitative comparison results of our model with various baseline models on the HDTF test set for both self-driven and cross-driven settings. Our model surpasses other benchmarks in terms of image generation quality (as indicated by FID and SSIM metrics), lip synchronization accuracy (LMD or LSE-C), face swapping fidelity (ID Retrieve and Consistency). After incorporating additional training data, our model's performance further improved.

Demo Videos

The comparison between our proposed SwapTalk and various vanilla cascade methods under the cross-driven setting shows that Sync-Swap falls short in lip performance compared to Swap-Sync. Furthermore, both numerical and visual results indicate that integrating a face restoration model (as seen in the Sync-Swap-Restore and Swap-Restore-Sync schemes) significantly enhances image clarity (as measured by the CPBD metric) but at the expense of lip synchronization (LMD and LSE-C), face swapping fidelity, and video identity consistency (ID Retrieve and Consistency). Faces processed with image restoration technology often exhibit errors in facial texture, gaze, and lip detail accuracy, whereas our model demonstrates better performance in these aspects.