Image Captioning using Attention-based models

Original Source Here

Dataset and evaluation metrics

Our proposed image captioning model is evaluated on the Flickr8k dataset. The dataset contains 8091 images. Most of these images depict humans performing various activities. Each image is paired with 5 described captions. We used different metrics, including BLEU, METEOR, ROUGE, CIDEr, and SPICE, to evaluate the proposed model and compare it with other baselines. All metrics are computed with the publicly released code.

Baseline A

For the first baseline, we have implemented the following steps. First, we added “startonee” and “endonee” at the beginning and end of each caption respectively. Then we performed pre-processing on the captions to improve the structure of the text data. There are several steps that are done at this stage, such as converting into lower case, removal of punctuation, and words containing numbers. For the embedding layer, we used GloVe 42B 300d pre-trained word embedding. Pretrained network InceptionV3 was used to extract 2048 length features from the image. Each caption was processed into tokens and fed into the network along with the image to predict the next word in the caption.

Baseline B

For the second baseline, we have extracted features of images using the InceptionV3 model having Imagenet weights as initialization. “start” and “end” was appended at the beginning and end of each caption. Captions were then tokenized using the nltk library for further use. Padding was performed to normalize the length of the captions with respect to the longest caption. The training and testing split ratio used for evaluation is 80:20. Bahdanau’s attention was used in this baseline to focus on salient objects of the images, which was inspired by the paper “Show, Attend and Tell”. For the encoder, CNN and ReLU layers were used. Moreover, unlike baseline A, where we used LSTM in the decoder component, we have used Gated recurrent units in baseline B. The implementation of Baseline B was adapted from TensorFlow’s website.

Results for Baseline A and B

Final Model A and B

For the final evaluation, we proposed two attention-based models inspired by Show, Attend, and Tell. Model A uses Madgrad optimizer along with ResNet152 module while Model B uses Madgrad optimizer with VGG16 module.

Architecture for the final models:

Results obtained by using models A and B can be seen in the table given below.

Following are some results for the Final model B & Baseline B:

We can see some examples of attention visualizations for Final model B below:


  1. Attention Is All You Need.
  2. Show, edit and tell
  3. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: