SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space


Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the public dataset HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity and identity consistency.


Our model is capable of transferring the facial region of a user-defined personalized avatar (source ID) onto a specified target template, while also accommodating lip shape deformations to ensure that the lip movements in the generated video are synchronized with the user-specified audio content.


The following figure illustrates the overall framework of our proposed method. The facial image is first encoded into the VQ-embedding space. Then, the face swapping module (c) and the lip-sync module (d) handle face swapping and lip synchronization, respectively. Finally, the VQ Decoder converts the output back into RGB space, producing a customized talking face video.


Quantitative Experiments

Quantitative comparison results of our model with various baseline models on the HDTF test set for both self-driven and cross-driven settings. Our model surpasses other benchmarks in terms of image generation quality (as indicated by FID and SSIM metrics), lip synchronization accuracy (LMD or LSE-C), face swapping fidelity (ID Retrieve and Consistency). After incorporating additional training data, our model's performance further improved.


Demo Videos

The comparison between our proposed SwapTalk and various vanilla cascade methods under the cross-driven setting shows that Sync-Swap falls short in lip performance compared to Swap-Sync. Furthermore, both numerical and visual results indicate that integrating a face restoration model (as seen in the Sync-Swap-Restore and Swap-Restore-Sync schemes) significantly enhances image clarity (as measured by the CPBD metric) but at the expense of lip synchronization (LMD and LSE-C), face swapping fidelity, and video identity consistency (ID Retrieve and Consistency). Faces processed with image restoration technology often exhibit errors in facial texture, gaze, and lip detail accuracy, whereas our model demonstrates better performance in these aspects.

In self-driven scenarios, although Wav2Lip achieves a higher LSE-C score, our methods (Ours Lip-Sync and SwapTalk) produce videos with lip movements that are more visually synchronized with the audio. In cross-driven settings, our models (Ours Lip-Sync and SwapTalk) perform significantly better. We infer that the Wav2Lip model may rely more on the expressions and poses of the upper face rather than the audio information for predicting lip shapes, which could explain its inferior performance in handling arbitrary speech lip-sync tasks.

We observe that the VQGAN's compression ratio significantly influences the performance of our model. The following video demonstrates the lip-sync outcomes at varying compression ratios; the 8x compression rate introduces noticeable flickering and ghosting in the mouth region, whereas the 16x rate achieves superior results.