Unlike the field of image generation where many studies have succeeded in generating high-resolution and high-fidelity realistic images, video generation with unconditional GANs is still a challenging problem (Saito et al., 2018). A reason videos might be a harder problem than images is the that videos require larger memory and computational costs than static images (ibid.), and therefore involve increased data complexity (Aidan et al., 2019).
Recently, an article by DeepMind (Aidan et al., 2019), introduced the Dual Video Discriminator GAN (DVD-GAN), that scales to longer and higher resolution videos. It beat previous attempts on various performance metrics for synthesis on the Kinetics-600 dataset.
DVD-GAN synthesized video with a 3.35 Fréchet Inception Distance score (a metric that captures the similarity of ordered generated images), and a 64.05 Inception Score (a metric of performance modelled on the judgment of human annotators) for synthesised video at 12fps and a resolution of 256 × 256. However, the videos are very short — up to 48 frames — which amounts to only 2 seconds of video at 24 fps.
When will a generative model produce a video of at least 2880 frames, at a 256 × 256 resolution or better, with a reported Fréchet Inception Distance of less than 0.100, or an Inception Score of greater than 500.00?
This question resolves as the date when such a model is reported in a preprint or peer-reviewed journal.