Evaluating Mode Collapse in GANs Using NDB Score

Original Source Here

Evaluating Mode Collapse in GANs Using NDB Score

Below are a few art pieces I generated from a GAN. They aren’t striking at all, but they’re diverse. However, this is not always the case.

GAN generated images

The next set of images is from another GAN I trained. Not only are they awful, but they’re also identical.

Gan generated images

GANs are notoriously hard to train. They seldom, if ever, converge and often suffer from mode collapse. As illustrated in the above images, mode collapse happens when GANs fail to pick up the different modes present in data distribution and generate similar pictures relentlessly.

It’s convenient to spot mode collapse by merely plotting images, but as dataset size increases, it might be handy to evaluate it quantitatively. We’ll do that using the NDB score.

This post assumes familiarity with the GAN training mechanism. Refer to this post if you don’t know how they function.

Mode Collapse

You see, mode collapse is ingrained in the GAN training strategy. Real-world data is multi-modal, and an ideal GAN must capture them all. For instance, each digit in the MNIST dataset is a separate mode, and you’d prefer a GAN that generates all the numbers. However, we generally never incentivize them to do so.

Suppose the generator constructs the digit ‘2’ well enough to fool the discriminator. It doesn’t need to hustle anymore. The discriminator, though, during its training iteration, will receive these generated twos labeled as fake and, over time, learn to catch the bluff. When this happens, the generator could easily switch to another digit, say ‘3’, and continue the mode collapse loop. Intuitively, you could consider this as apathy to work extra when less is sufficient.

Now, let’s learn to track this phenomenon qualitatively.

Setting up the GAN

The full notebook for this implementation can be found at these links:


The data used for training can be found here (license). These are a few images from it.

We’ll start by making these imports.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape, Conv2D, BatchNormalization, Conv2DTranspose
from tensorflow.keras.layers import LeakyReLU, Dropout, ZeroPadding2D, Flatten, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.cluster import KMeans

Next, we’ll load the images from the directory using the TensorFlow data loader, reduce their shapes to (64,64), and normalize them. Note the batch size here is half of the global batch because the other half would come from generator images.

BATCH = 64
IMG_SIZE = (64,64)
EPOCHS = 600
PATH = "../input/abstract-art-gallery/Abstract_gallery/Abstract_gallery"
#Importing data

batch_s = int(BATCH/2)
#Import as tf.Dataset
data = tf.keras.preprocessing.image_dataset_from_directory(PATH, label_mode = None, image_size = IMG_SIZE, batch_size = batch_s).map(lambda x: x /255.0)

Let us now build the generator and the discriminator. Note that the discriminator does include any pooling layers. According to this 2015 paper, stridden convolutions perform better than pooling layers.

generator.add(Conv2DTranspose(256, kernel_size=4, strides=2, padding="same"))
generator.add(Conv2DTranspose(128, kernel_size=4, strides=2, padding="same"))
generator.add(Conv2DTranspose(64, kernel_size=4, strides=2, padding="same"))
generator.add(Conv2DTranspose(3, kernel_size=4, strides=2, padding="same",
discriminator.add(Conv2D(32, kernel_size=4, strides=2, padding="same",input_shape=[64,64, 3]))
discriminator.add(Conv2D(64, kernel_size=4, strides=2, padding="same"))
discriminator.add(Conv2D(128, kernel_size=4, strides=2, padding="same"))
discriminator.add(Conv2D(256, kernel_size=4, strides=2, padding="same"))

Define the training process

class GAN(tf.keras.Model):
def __init__(self, discriminator, generator, latent_dim):
super(GAN, self).__init__()
self.discriminator = discriminator
self.generator = generator
self.latent_dim = latent_dim

def compile(self, d_optimizer, g_optimizer, loss_fn):
super(GAN, self).compile()
self.d_optimizer = d_optimizer
self.g_optimizer = g_optimizer
self.loss_fn = loss_fn
self.dloss = tf.keras.metrics.Mean(name="discriminator_loss")
self.gloss = tf.keras.metrics.Mean(name="generator_loss")

def metrics(self):
return [self.dloss, self.gloss]

def train_step(self, real_images):
batch_size = tf.shape(real_images)[0]
noise = tf.random.normal(shape=(batch_size, self.latent_dim))
generated_images = self.generator(noise)
combined_images = tf.concat([generated_images, real_images], axis=0)
labels = tf.concat([tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0)
labels += 0.05 * tf.random.uniform(tf.shape(labels))
with tf.GradientTape() as tape:
predictions = self.discriminator(combined_images)
dloss = self.loss_fn(labels, predictions)
grads = tape.gradient(dloss, self.discriminator.trainable_weights)
self.d_optimizer.apply_gradients(zip(grads, self.discriminator.trainable_weights))

noise = tf.random.normal(shape=(2*batch_size, self.latent_dim))
labels = tf.zeros((2*batch_size, 1))
with tf.GradientTape() as tape:
predictions = self.discriminator(self.generator(noise))
gloss = self.loss_fn(labels, predictions)
grads = tape.gradient(gloss, self.generator.trainable_weights)
self.g_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))
return {"d_loss": self.dloss.result(), "g_loss": self.gloss.result()}

Now, let’s move to the evaluation. We’ll use k-means clustering, which might not make sense qualitatively since all our images are random paintings(single class). Nevertheless, I expect the k-means algorithm to identify subtle similarities among them and create appropriate clusters.

We have RGB images of shape (64,64). To reduce dimensions, we’ll average the arrays along the last axis to convert them to grayscale. Note that the actual formula for converting to grayscale is different. Refer to this link for more information. We can further shrink the dimensions using autoencoders/PCA, but I’ll refrain for now. Finally, for clustering, we’ll also flatten the images.

images = np.asarray(images)
images = np.mean(images,axis=3)
images = images.reshape((images.shape[0],-1))

To limit computation effort, I’ve only used the first 500 images to create clusters. The elbow score is not quite plateauing yet. It could be because of the subtle differences between images of the same class. Further reduction in image dimensionality might help create better clusters. For illustration, we’ll work with the kink at cluster 7.

for c in range(4,10):
kmeans = KMeans(c)

plt.xlabel('Number of Clusters')
plt.title('Elbow Score')
Image by author

Now, we’ll generate 500 images from the generator and see which clusters they fall into.

arr = tf.random.normal(shape=(500,LATENT_DIM))
generated_portraits = generator(arr)
generated_portraits = np.array(generated_portraits).mean(axis=3).reshape((generated_portraits.shape[0],-1))
generated_classes = kmeans.predict(generated_portraits)

We have generated images from all but cluster 4. The GAN seems to have learned the distribution well and could improve with more training iterations/hyper-parameter tuning. Next, we expand this evaluation to create a more concrete statistical test (NDB score).

Image by author

NDB Score

The ideal GAN must closely mimic the real data distribution. This is quantified using the NDB score. Here’s how it’s computed:

  1. Cluster the training data (t samples) into ’n’ bins (Like we have clustered the paintings into 7 bins)
  2. Generate (g samples) of images
  3. Predict the cluster(bin) of each generated image
  4. For each bin, do the following test:

a. Compute the proportions of training and generated samples in the bin

b.Divide their difference by the standard error SE, which is calculated as shown below.

Image by author

Here ‘p’ and ‘q’ are used to refer to training and generated data, and ‘P’ is the pooled sample proportion.

c. If the p-value corresponding to the z-score is less than a threshold, the bin is deemed statistically different

5. Divide the number of statistically different bins by the total number of bins. This yields a number b/w 0 and 1, quantifying the difference between the real and learned distributions.

6. If the above quantity is greater than a set threshold, the GAN is deemed to have encountered mode collapse.

def ndb_score(training_data_classes,generated_data_classes,num_classes,z_threshold):
ndb = []
NT = len(training_data_classes)
NG = len(generated_data_classes)
for i in range(num_classes):
nt = np.sum(training_data_classes==i)
pt = nt/len(training_data_classes) #training data proportion for bin
ng = np.sum(generated_data_classes==i)
pg = ng/len(generated_data_classes) #generated data proportion for bin
P = (nt+ng)/(NT+NG)
SE = (P*(1-P)*((1/NT)+(1/NG)))**0.5
if abs((pt-pg)/SE) > z_threshold:
print(f"Statisticall different classes:{ndb}")
print(f"ndb score: {len(ndb)/num_classes}")

Our GAN has an NDB score of 0.25, and only two clusters-4,5- appear statistically different. So, we’ve successfully avoided the devious trench of mode collapse. This function can be made part of the GAN class and run at the end of each epoch as a validation scheme. You’ll find that code here.


Thanks for reading till the end. There are a lot of ideas around how to avoid mode collapse. Besides tuning hyper-parameters and trying different loss functions, one could adopt a different training strategy. This paper details a few mechanisms. I’ll try to cover them in some other post.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: