High Resolution Video Generation using Spatio-Temporal GAN

Original Source Here

In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in this work. Our network uses a combination of two loss terms: mean square pixel loss and an adversarial loss. The datasets used for training and testing our network are UCF101, Golf and Aeroplane Datasets. Using Inception Score and Fréchet Inception Distance as the evaluation metrics, our network outperforms previous state of the art networks on unsupervised video generation.


Deep learning for tackling computer vision problems has been mostly based on static image based approaches. However most real world data are dynamic in nature containing an additional time dimension which connects the images or individual frames together. Due to the presence of temporal dynamics, more information about the scene can be extracted. The challenge with video data is the additional computational burden and inherent complexity due to an additional time component. However, static image based algorithms are not suitable for action prediction problems (Huang et al., 2018). Hence, video based algorithms are the need for action prediction problems.

Neural networks for video generation from latent vectors is a challenging problem. State of the art methods produced blurry results thus showing the complexity of the problem (Vondrick et al., 2016). It is important to understand how pixels change in between the frames and model the uncertainty involved as shown in (Villegas et al., 2017). In case of video data, temporal dynamics needs to be separately modelled from the spatial dynamics. To infer various objects present in the scene spatial dynamics are used, where as the movement of these objects can be inferred from temporal dynamics. To solve this 1-D convolutions was used for temporal generator (Saito et al., 2017) and Recurrent Neural Networks (RNN) to generate latent code for image based generators (Tulyakov et al., 2018). Using 1-D convolutions reduces the computational burden, however for more accurate frame generation 3-D convolutions should be used.

However all of these previous work tackles very specific problems thus making generalization to other similar tasks difficult. Also almost all the architectures used in the literature, work for only specialized problems. Our work presents a novel unsupervised GAN based architecture for video generation/prediction which can be generalized to other settings.

Important Points

* We propose a GAN technique for unsupervised video generation of resolution 256256.

* We present the architecture details of our network, optimization and loss functions used.

* We validate our network on publicly available UCF101 Dataset, Golf and Aeroplane Datasets for both qualitative and quantitative comparison.

* Our network beats the previous state of the art methods in this domain using Inception Score and Fréchet Inception Distance as the evaluation metrics.


The following datasets were used in this work for training and testing our network for video generation:

  1. UCF101 Dataset: The purpose of this dataset was training networks robust for action recognition tasks. It contains 13320 videos of 101 different action categories like Sky Diving, Knitting and Baseball Pitch (Soomro et al., 2012).
  2. Golf and Aeroplane Datasets: It contains 128×128 resolution frames which can be used for evaluating video generative adversarial networks (Vondrick et al., 2016) and (Kratzwald et al., 2017).

Network Architecture

Let input sequence frames of a video be denoted by (X = X1, …, Xm) and frames to be predicted in sequence by (Y = Y1, …, Yn). Our network for video generation has two stages: first: a new conditional generative adversarial network (GAN) to generate sequences performing a given category of actions, second: a reconstruction network with a new loss function to transfer sequences to the pixel space.

The input sequence frames in the form of noise vector is input to the Generator. The Generator generates output frames corresponding to the input frames. The output frame sequence is propagated to the Discriminator which tells whether the generated frames are real or fake. Both the Generator and 3 Discriminator is trained using mini batch Stochastic Gradient Descent(SGD) using the corresponding loss functions.

3D deconvolutional layers are used for generator and 3D convolutional layers for discriminator. Batch normalization is used for generator and instance normalization is used for discriminator network. ReLU activations is used as non linearity for generator and leaky ReLU activations for discriminator. The network architecture used in this work is shown in Figure 1:

Figure 1: Network architecture used in this work

Evaluation Metrics

A lot of metrics have been proposed for evaluating GANs in the literature. Two of the most common metrics are Inception Score and Fréchet Inception Distance which are explained below:

  1. Inception Score (IS) — Inception Score was first proposed in (Salimans et al., 2016) for evaluating GANs. A higher inception score is preferred which means the model is able to generate diverse images thus avoiding mode collapse issue.
  2. Fréchet Inception Distance (FID) — Another metric to evaluate the quality of generated samples was first proposed by (Heusel et al., 2017


The linear interpolation in latent space to generate samples from Golf dataset is shown in Figure 2:

Figure 2: Linear interpolation in latent space to generate samples from Golf dataset

The linear interpolation in latent space to generate samples from Aeroplane dataset is shown in Figure 3:

Figure 3: Linear interpolation in latent space to generate samples from Aeroplane dataset

The generated frames using UCF-101 dataset is shown in Figure 4:

Figure 4: Results on UCF-101 generated from random noise. For each task, we display 8 frames of our generated videos for the JumpingJack (1st row) and TaiChi (2nd row).


In this paper, we presented a novel neural network using generative models for unsupervised video generation. Our network is an extension of original GAN architecture which is trained using mini batch Stochastic Gradient Descent. The novel loss term is made up of a mean square pixel loss along with an adversarial loss which uses k-Lipschitz constraint on it as used in Wasserstein GANs. We present the architecture details, optimization and the complete algorithm used in this work. On testing our network on UCF101, Golf and Aeroplane Datasets using Inception Score and Fréchet Inception Distance as the evaluation metrics, our network outperforms previous state of the art approaches. Finally we also present the linear interpolation in latent space on Golf and Aeroplane Datasets and the frames generated using UCF101 dataset.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2 ,§3.2 .
  • I. Deshpande, Z. Zhang, and A. G. Schwing (2018) Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483–3491. Cited by: §2 ,§2 .
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2 ,§2 ,§4.3 .
  • A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2 .
  • C. Wang, C. Xu, C. Wang, and D. Tao (2018) Perceptual adversarial networks for image-to-image transformation. IEEE Transactions on Image Processing 27 (8), pp. 4066–4079.

Before You Go

Paper: https://arxiv.org/pdf/2008.09646.pdf

Code: https://github.com/abhinavsagar/hrvgan


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: