DPT : Segmentation Model Using Vision Transformer

Original Source Here


DPT (DensePredictionTransformers) is a segmentation model released by Intel in March 2021 that applies vision transformers to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K, and it can also be used for monocular depth estimation with an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network.


In DPT, vision transformers (ViT)are used instead of convolutional network. Using transformers allows to make more detailed and globally consistent predictions compared to convolutional network. In particular, performance is improved when a large amount of training data is available.

Source: https://arxiv.org/pdf/2103.13413

The encoder divides the image into tiles, which are then tokenized (Embed in the graph above), and transformers process it. The process marked as Embed is a path-based method to divide image into tiles, and tokenize the pixel feature map obtained by applying ResNet50 to the input image.

The decoder in DPT converts the output of each resolution of the transformer into an image like representation and uses a convolutional network to generate the segmentation image.

There are three model architectures defined in DPT: ViT-Base, ViT-Large, and ViT-Hybrid. ViT-Base performs patch-based embedding and has 12 transformer layers. ViT-Large performs the same embedding as ViT-Base, but has 24 transformer layers and a larger feature size. ViT-Hybrid performs embedding using ResNet50 and has 12 transformer layers.

DPT accuracy

DPT sets a new state of the art for the semantic segmentation task on ADE20K, a large data set with 150 classes.

Source: https://arxiv.org/pdf/2103.13413

It is also the state of the art after some fine-tuning on smaller datasets such as NYUv2, KITTI, and Pascal Context.

Source: https://arxiv.org/pdf/2103.13413

Below is a comparison of MiDaS and DPT for depth estimation. DPT is able to predict the depth inmore detail. It can also improve the accuracy of large homogeneous regions and relative positioning within an image, which is a shortcoming of convolution networks.

Source: https://arxiv.org/pdf/2103.13413

Below is a comparison for the segmentation task. DPT tends to produce more detailed output at object boundaries, and it tends to produce less cluttered output in some cases.

Source: https://arxiv.org/pdf/2103.13413


You can use the following commands to perform segmentation and depth estimation on the input images with ailia SDK.

$ python3 dense_prediction_transformers.py -i input.jpg -s output.png --task=segmentation -e 0$ python3 dense_prediction_transformers.py -i input.jpg -s output.png--task=monodepth -e 0

Here is a result you can expect.

Related topics


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: