Original Source Here
As illustrated in Figure above, the first part of our GPEN is a CNN encoder, which learns to map the input degraded image x to a desired latent code z in the latent space Z of the GAN. The GAN prior network can then reproduce the desired HQ face image via G(z) → y, where G refers to the learned generator of GAN. This formulation changes the one to many problem to one-one problem.
The GAN prior network
Inspired from StyleGAN, the prior includes a mapping network to project latent code z into a less entangled space w ∈ W and this intermediate code w is then broadcasted to each GAN block. Furthermore, the noise inputs are concatenated rather than added to the convolutions in StyleGAN.
Since the GAN prior network will be embedded into a U-shaped DNN for fine-tuning, additional noise inputs to provided to each GAN block as proxy for skip connections.
Full network architecture
Once the GAN prior network is pre-trained, the latent code z and the noise inputs to the GAN network are replaced by the output of the fully-connected layer and shallower layers of the encoder of the DNN. Since a Fully Connected(FC) is used in the CNN encoder, all the images used should be of the same resolution, hence LQ face images are first resized to the desired resolution using simple bilinear interpolator before being input to the GPEN.
To fine-tune the GPEN model, three loss functions are adopted: the adversarial loss, the content loss, and the feature matching loss.
Adversarial loss is inherited from the GAN prior network.
where X and X˜ denote the ground-truth HQ image and the degraded LQ one, G is the generator during training, and D is the discriminator.
Content Loss is the L1-norm distance between the final results of the generator and the corresponding ground-truth image.
Feature Matching Loss is similar to the perceptual loss but it is based on the discriminator rather than the pre-trained VGG network to fit the task.
The final loss L is as follows:
The content loss enforces the fine features and preserves the original colour information. By introducing the feature matching loss on the discriminator, the adversarial loss can be better balanced to recover more realistic face images with vivid details. In all the following experiments, we empirically set α = 1 and β = 0.02.
GAN prior network is trained using the FFHQ dataset(which contains 70,000 HQ face images of resolution 1024 X 1024) with similar settings to StyleGAN. The pre-trained GAN prior network is embedded into the GPEN to perform fine-tuning.
To build LQ-HQ image pairs for fine-tuning, we synthesise degraded faces from the HQ images in FFHQ using the following degradation mode:
where I, k, nσ are respectively the input face image, the blur kernel, the Gaussian noise with standard deviation σ. ⊗, ↓s, JPEGq respectively denote the two-dimensional convolution, the standard s-fold down sampler and the JPEG compression operator with a quality factor q.
To summarise, the image is first blurred, then down sampled to the desired resolution. To this a gaussian noise is added and then JPEG compressed.
For each image the blur kernel k is randomly selected from a set of blurring models, including Gaussian blur and motion blur with varying kernel sizes. The additive Gaussian noise nσ is sampled channel-wise from a normal distribution, and σ is chosen from [0, 25]. The value of s is randomly and uniformly sampled from [10, 200] (i.e., up to 200 times downscaling) and q is randomly and uniformly sampled from [5, 50] (i.e., up to 95% JPEG compression) per image.
The assumption behind leveraging this varied combination of above hyperparameters in generating the LQ images for fine-tuning is, the superset of these synthetic degradation captures degradation in faces in real-world scenarios.
To better understand the roles of different components of GPEN and the training strategy, an ablation study is conducted by introducing some variants of GPEN and comparing their BFR performance.
The variants are
- GPEN-w/o-ft, i.e., the embedded GAN prior network is kept unchanged in the fine-tuning process.
- GPEN-w/o-noise, which refers to the GPEN model without noise inputs.
- GPEN-noise-add, i.e., that the noise inputs are added rather than concatenated to the convolutions
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot