8 Tips To Build Powerful Deep Learning Models for Visual Similarity

https://miro.medium.com/max/1200/0*MtQUnkqyF1WPVfxU

Original Source Here

8 Tips To Build Powerful Deep Learning Models for Visual Similarity

3-month condensed general best practices to make your siamese networks behave well and produce high-quality embeddings

Photo by Xavi Cabrera on Unsplash

A while ago, I participated in a data science challenge that took place at my previous company. The goal was to help marine researchers better identify whales based on the appearance of their flukes.

More specifically, we were asked to predict for each image of a test set, the top 20 most similar images from the full database (train+test).

This was not a standard classification task.

I spent 3 months prototyping and ended up third out of 300 participants on the final (private) leaderboard.

For the story, Zeus is my home GPU-powered server. That’s right, it has a name.

Final (private) leaderboard

But let’s not get into the specifics of this challenge.

The purpose of this post is to share with you my tips on building strong embedding models for visual similarity tasks. This challenge was a superb learning opportunity where I tried a lot of different techniques. So I’ll share with you here what best worked and what did not, and I’ll detail the different steps I undertook in the process.

Without much further ado, let’s have a look 🔍

PS: the code of the following experiments is on my Github repo.

1 — Formalize the problem and choose the right loss?

The underlying question I first asked myself is: How can I build a numerical representation of a whale’s fluke that can efficiently embed its characteristics and be used in the similarity tasks?

The goal is to build a model that generates a “good” representation of the input image: a whale’s fluke — Image by the author

First approach: classification

The naïve approach I went for at first was to train a convolutional neural network (CNN) to classify the images over their set of labels (the whale ids) using the standard softmax cross-entropy loss and then take the output of the last fully connected layer as embedding. Unfortunately, training the network to optimize the cross-entropy doesn’t produce good embedding vectors for similarity.
The reason why it’s not very efficient in this problem is that cross-entropy only learns how to map an image to a label, without learning the relative distances (or similarities) between the inputs.

When you need an embedding for visual similarity tasks, your network should explicitly learn, at training time, how to compare and rank items between each other. If you want to learn more, I recommend this post.

From classification to metric learning

The task that learns efficient embeddings that compare and rank inputs between each other is called metric learning.
This is a well-studied topic that has been applied in popular applications such as face identification or image retrieval.
I won’t cover what metric learning is in this post. There are excellent tutorials that explain it very well here and here.

I will just introduce two loss functions I experimented with during this challenge.

1. Triplet Loss

Triplet loss has been introduced in the FaceNet paper by Google in 2015.
The authors explored a new technique for face embeddings by designing a system that learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity.
The proposed method optimizes the embedding itself rather than an intermediate loss that doesn’t explicitly solve the problem.

Source: https://omoindrot.github.io/triplet-loss

This loss is defined over a triplet of data:

  • An anchor image that represents a reference
  • A positive image of the same class as the anchor
  • A negative image of a different class

and optimizes the weights of a model in such a way:

  • the euclidean distance between the embedding of the anchor and the embedding of the positive image, i.e. d(a, p), is low
  • the euclidean distance between the embedding of the anchor and the embedding of the negative image, i.e. d(a, n) is high

The triplet loss can be formalized as follows:

L = max(d(a, p) — d(a, n) + margin , 0)

This loss is by definition lower-bounded by 0. So optimizing the network will push it toward 0 as much as possible. When the training is complete:

  • d(a, p) becomes very small ~0
  • d(a, n) is greater than d(a, p) + margin

There are a few training tricks that I used to improve triplet loss training:

  • Hard sampling: I used hard triplets only to optimize the loss.
    A hard triplet (a, p, n) satisfies this inequality: d(a, n) < d(a, p)
  • PK Sampling: I used a sampler in my PyTorch dataloader to make sure each batch is of PK size, being composed of P different classes with K images each.
  • Online generation of triplets

You can find the implementation details of these tricks on my Github repo and if you want to learn more about these techniques, I recommend reading this paper.

2. ArcFace

I came across this loss three weeks before the end of the challenge and I was blown away by its effectiveness right at the moment I tried it.

The ArcFace loss has been introduced in 2019 (CVPR) and its main goal is to maximize face class separability by learning highly discriminative features for face recognition. According to the writers of the paper, this method outperformed triplet loss, intra-loss, and inter-loss on the most common face identification benchmarks.

Source: ArcFace paper

When given a feature vector extracted from the network and the corresponding ground-truth value (in this case, the whale id), arcface learns a weight matrix to map the computations into a new space where it computes an angle between the features and the targets. This space has, therefore, a geometric interpretation.
Then it adds a margin to this angle, reverts back to the original space, and applies the cross-entropy softmax loss. The main benefit of this loss is the transition into a new space, where separability is maximized. Despite this modification, ArcFace is not different from a softmax cross-entropy loss, thus the training overhead is small.

Source: ArcFace paper

When I was experimenting with ArcFace, I noticed some benefits over the triplet loss:

  • ArcFace scales well for a large number of classes
  • It alleviates the problem of hard sample mining encountered when training triplet losses (since it doesn’t need one). All it needs is data and corresponding labels.
  • It provides a nice geometric interpretation
  • It provides a stable training
  • It converges faster
  • and most importantly, a single model trained with this loss performed better than a blend of five models trained by triplet loss.

That’s why I used it in my final submission.

ArcFace was the cornerstone of my solution. Let’s now have a look at the different steps that helped efficiently set up my training.

2 — Be one with the data 🗂

This goes without saying, but I’m gonna say it anyway: spend as much time as you can inspecting your data. Whether you’re working in Computer Vision or NLP, a deep learning model, like any other model for that matter, is garbage-in garbage-out. It doesn’t matter how many deep layers it has. If you feed it poor-quality data, you should not hope for good results.

There are a couple of things I did on the data of this challenge (this procedure obviously applies to any dataset in metric learning tasks):

  • I removed noisy and corrupt images where either the resolution was very low or the whale’s fluke wasn’t visible at all
  • I discarded classes that had one image only: this proved to be very efficient. The reason behind this is that metric learning tasks need a bit of context about each class: one image per class was obviously not enough.
  • I extracted the bounding boxes of the whales’ flukes in order to discard any surrounding noises (water splash, sea) and zoom on the relevant information. This later acts as an attention mechanism.
    To do this, I trained a Yolo-V3 Fluke detector from scratch after annotating about ~ 300 fluke whales on makesense.ai, an image labeling tool built by Piotr Skalski.
    I also used this excellent repo to train the Yolo-V3 model.
Screenshot by the author

Key learning 👨‍🏫: you’ll likely win more points with properly cleaned data than sophisticated modeling.

3 — Do not underestimate the power of transfer learning 🔄

In the first weeks of the competition, I used ImageNet pretrained models (renset34, densenet121, etc.) as backbones. It was fine, my models ended up converging after some time.

Then I looked into Kaggle Humpback Whale Identification competition data.

  • Despite the leaderboard metric, this competition was very similar to our challenge
  • The data had the same structure as ours with the same class imbalance problem
  • The flukes don’t exactly look the same as in our competition. They come from another species — but this was fine.
Kaggle whales’ fluke — source: Kaggle

I immediately decided to finetune the ImageNet pretrained models on this data using the triplet loss.

Funny how things worked out:

  • This had a huge impact! I jumped up in the leaderboard
  • The network was able to converge faster (in 30% fewer epochs)

Key learnings 👨‍🏫:

  • Transfer learning rarely hurts. If you start with ImageNet models that are pretrained on 1000 common objects (animals, cars, etc.), it’s more likely that a pretrained network on a similar dataset of yours is better.
  • Transfer learning is an indirect way to bring more data to your training

4 — Input shapes strongly matter 📏📐 🔍

There is an important detail to mention about the data of this challenge: its high resolution. Due to professional equipment, some images reach 3000×1200 pixels or higher.

When I started the competition, I set the input size of my network to 224×224 pixels, as I typically do in most image classification problems.

However, when I started varying the input sizes, I got a lift in performance. 480×480 was the best input shape that worked for me.

Key learnings 👨‍🏫:

  • If you’re dealing with high-resolution images, try increasing the input size of your network. The default 224×224 input shape recommended by ImageNet is not always the optimal choice. With larger input shapes, your network can learn specific small fine-grained details that distinguish one whale from another.
  • Bigger is not always better. If you increase your input shape to 1000px or so, you’ll more likely encounter these two issues:
  1. Slow training: with higher input shapes, your network has more parameters and this obviously needs more computational power, and convergence is not guaranteed either due to overfitting.
  2. Poor performance on small images: when tiny images are up-sampled to 1000x1000px resolution, the original signal is corrupt.

5 — A sophisticated architecture is not necessarily the optimal choice 🤹

If you’re a bit familiar with the computer vision ecosystem, you’ve probably heard of some popular architectures such as VGG or ResNet, or less likely of recent and sophisticated ones such as ResNet-Inception-V4 or NASNet.

Benchmark Analysis of Representative Deep Neural Network Architectures: paper

Here are the key learnings 👨‍🏫 I came to after three months of experimentation:

  • Large and deep state-of-the-art backbones are not always the optimal choice: if your dataset is small (like the one in this challenge) they quickly overfit, and if you have little computational resources, you won’t be able to train them
  • The good approach is to start with a simple network and increase the complexity step by step while monitoring the performance on a validation dataset
  • If you plan to ship your solution in a web application, you have to think about model size, memory consumption, inference time, etc.

6 — Design a robust pipeline ⚙

The training pipeline I put in place consists of 5 major steps:

Training pipeline — Source: my Github repo
  • Step 1: the dataloder connects to the database and serves the images and the corresponding labels to the network, in batches. It’s also responsible for shuffling the data between the epochs and applying on-the-fly data augmentation.
    Heavy augmentation has been applied as a regularization effect for a better generalization. Transformations include: Gaussian noise and blur, motion blur, random rain (to simulate splash effects), color shift, random change in brightness, hue and saturation, sharpening, random change of perspective, elastic transformations, random rotation ± 20°, affine transformations (translation and shearing), and random occlusion (to increase generalization capabilities)
  • Step 2: forward pass. The model takes the images as input and generates the features.
  • Step 3: the arcface loss is computed between the features and the targets
  • Step 4: back-propagation. The gradients of the loss w.r.t. model parameters are computed
  • Step 5: The Adam optimizer updates the weights using the gradients of the loss. This operation is performed on each batch.

7 — General training tips from top Kagglers 👨‍🏫

I made a lot of experiments during this competition. Here is my list of tips to make training safe, reproducible, and robust.

  • Fix the seeds to ensure reproducibility. You will more likely have to write these few lines of codes at the beginning of your script

More details here.

  • Adam is a safe optimizer. However, you should not forget to set the weight decay to a non-zero value. This acts as a regularization effect and prevents loss fluctuations. Used value: 1e-3
  • Heavy augmentation really improves the results. I started with simple rotations and translations, but when I added the transformations mentioned above, I got better results. Augmentations alleviate the problem of the lack of data and improve the model stability and generalization. To build an efficient augmentation pipeline, I highly recommend the albumentations library.
  • Use a learning rate scheduler to decrease the learning rate throughout the training. This prevents the loss to be stuck at local minima.
    The one I ended up choosing is a warmup scheduler followed by a cosine annealing.
    It basically starts from a small learning rate to reach the target (starting learning rate) on a few epochs (this is called the warmup phase) and then decreases it following a cosine annealing until an end learning rate.
    The warmup phase acts as a regularization effect to prevent early overfitting.
Source: researchgate.net
  • Monitor the loss values and other metrics at the end of each epoch. I used Tensorbaord to plot it.
  • Pseudo-labeling can give you an edge: this technique is commonly used in Kaggle competitions. It consists of training a model on your train data, using it on the test data to predict the classes, taking the most confident predictions (> 0.9 probability), adding them to the original train data, and retraining again.
  • Make sure you have the right hardware. I had access to a GPU server with 11Gb of GPU memory and 64GB of RAM. In terms of software, I was using a conda virtual environment with PyTorch 1.1.0 and torchvision 0.3.0.
    Training a Densenet121 backbone with the ArcFace loss on a 480px resolution image took approximately 1 minute per epoch. Convergence was around ~90 epochs.
  • Keep track of your experiences by journaling your models and saving them at the end of the training or during. You’ll find how this is done in my Github repo.

8 — Divide to conquer: combine multiple models for a final submission ⚡

I trained two models using the previous pipeline and the following parameters:

Github Repo

What gave me an edge in the final score was the way I combined them together. This is a simple meta-embedding technique that is quite commonly used in Natural Language Processing.

It consists of generating the embeddings of each model on all the samples and then concatenating them.

A meta-embedding model — Image by the author

This method is used to generate the meta-embeddings of the train and test data sets. Then, the same computations are used to generate the submission.

Key learning 👨‍🏫:

  • Meta-embedding concatenation techniques provide an interesting embedding when the base models are different in backbone architectures (resnet34 vs densenet121), image input size (480 vs 620), regularization schemes (dropout vs no dropout)
  • Each individual base model “sees” a different thing: combining their embeddings produces a new hybrid one with increased representational power.

Final words 🙏

I would like to thank the whole GDSC team for their work in making this challenge a great learning opportunity and Lisa Steiner for giving us the chance to apply our knowledge to a new field.

I hope you’ll find here resources that you can use in other computer vision and deep learning projects.

References 📜

New to Medium? You can subscribe for $5 per month and unlock unlimited articles on various topics (tech, design, entrepreneurship…) You can support me by clicking on my referral link

Photo by Karsten Winegeart on Unsplash

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: