Abstract
This project is dedicated to investigating the difficult audio-to-video generation with representation learning. Audio-to-video generation is an interesting problem that has abundant application across several industrial fields. Here, we propose a novel training flow consisting of pre-trained models (StyleGAN3, Wav2Vec2, MTCNN networks), newly trained models (variational autoencoders and transformers), and an adversarial learning algorithm. To the best of the author’s knowledge, this is the first implementation of audio-to-video generation using a pre-trained StyleGAN3. The input is a speech audio sequence and an image of a face. Our model will learn to “animate” the face by predicting facial expressions and lip movements. We find that the latent code of our generative model can be encoded 16-fold into a 96-dim vector that retains the information of the talking face. By using this method, audio-to-video generation can be realized without training any generative models, and only latent codes should be predicted from audio. This minimizes our requirement for dataset size and training time. (The reconstructed videos can be found here.)