Image Captioning Pytorch : A Machine Learning Model for Describing Images

Original Source Here


Image Captioning Pytorch is a machine learning model producing text describing what’s visible in the input image. Image classification consists in classifying the input image using predefined labels, whereas Image Captioning consists in describing the image content using natural language.

Input image (Source:

Here is the output image caption.

a giraffe and a zebra standing in a field (FC model)

a group of zebras and a giraffe in a field(FC+RL+SelfCritical model)

a group of zebras and a giraffe standing on a dirt road(FC+RL+new SelfCritical model)

Image Captioning Pytorch has been implemented based on the following paper.


There are two approaches to image captioning: TopDown and BottomUp.

In the TopDown approach, captions are generated from feature vectors computed using image classification backbone network such as ResNet50.

In the BottomUp approach, captions are generated from feature vectors computed using object detection backbone network such as Faster R-CNN.

Example of BottomUp approach (Source:

Image Captioning Pytorch uses the TopDown approach, which consists of an encoder to compute the feature vector and a decoder to output the caption. The encoder uses ResNet101 and outputs a feature vector of dimension 2048, while the decoder uses LSTM to produce a word sequence.

Reinforcement Learning (RL) has traditionally been proposed as a countermeasure to bias and serves as a baseline for learning image captioning. Self Critical Sequence Training (SCST) is also proposed, which improves the stability of reinforcement learning and provide best accuracy.


Image Captioning Pytorch uses an improved version Self Critical which is called new Self Critical.

This “new self critical” is borrowed from “Variational inference for monte carlo objectives”. The only difference from the original self critical, is the definition of baseline.

In the original self critical, the baseline is the score of greedy decoding output. In new self critical, the baseline is the average score of the other samples (this requires the model to generate multiple samples for each image).

Training datasets

Image Captioning Pytorch has been trained on the MSCOCO and Flickr 30k datasets.

Image Captioning Pytorch accuracy

Accuracy measurements are presented in



Use the following command to use Image Captioning Pytorch to generate caption of images from the webcam video stream.

$ python3 -v 0

The models FC, FC+RL+SelfCritical, and FC+RL+NewSelfCritical can be selected by respectively specifying fc, fc_rl, and fc_nsc in the model option.

Image Captioning Pytorch is available with ailia SDK 1.2.5 or newer.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: