MMDetection Tutorial in Kaggle— A State of the Art Object Detection Library

Original Source Here

Important Resources

Before we begin, here are some resources I will reference, use, or may help you understand MMDetection better.
MMDetection Github Repo
MMDetection Documentation
MMDetection Custom Dataset Tutorial
Kaggle Notebook

Installing the Required Libraries

For this tutorial, the notebook I wrote which I am referencing is this one. The first step you need to take is to install the mmdetection library. In Kaggle, you should start up a GPU Notebook and check the cuda build for the GPU that you’re connected to.

!nvcc -V
!gcc --version

Now find the cuda build version in the output.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO

The cuda build version is bolded above

Next, install the pytorch and torchvision versions that coorespond to the cuda version.

!pip install -U torch==1.7.1+cu110 torchvision==0.8.2+cu101 -f

Now you should install “mmcv-full”, which is an MM library which provides the base of MMDetection. Then, you clone the MMDetection Github repository and install the requirements.

Note: This step takes around 15 minutes so be patient. Don’t worry about it getting stuck building the wheel for mmcv-full.

!pip install mmcv-full
!rm -rf mmdetection
!git clone
%cd mmdetection
!pip install -e .
!pip install Pillow==7.0.0

Simple Demo Of MMDetection

Now that you have installed the required libraries, you can start dabbling with MMDetection. You can start out with inferencing on a demo image provided by MMDetection. I used a version of Mask RCNN trained on the Coco dataset and downloaded a checkpoint of Mask RCNN from MMDetection. Then, you use the MMDetection functions of init_detector, inference_detector, and show_result_pyplot to initialize the model and show the inference it makes on the image.

import torch, torchvision
import mmdet
from mmdet.apis import inference_detector, init_detector, show_result_pyplot
!mkdir checkpoints
!wget -c \
-O checkpoints/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth
config = 'configs/mask_rcnn/'
checkpoint = 'checkpoints/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth'
model = init_detector(config, checkpoint, device='cuda:0')
img = './demo/demo.jpg'
result = inference_detector(model, img)
show_result_pyplot(model, img, result, score_thr=0.3)
%cd ..
This is what the inference should look like.

Preprocessing Dataset

The dataset you will be using in this tutorial is a gun object detection dataset. It also is a little tricky to deal with. It contains annotations in txt files, so you need to preprocess it. First, you will convert the txt files to xml files(credits go to Siddhesh Sali for the function). You will do this by first moving copying all txt and img files in the dataset to a new directory. Next, you would iterate over each txt file and read the file. Then, you will split the file and map integers to each split part and assign the 0th element(containing the number of annotations) as n. The next thing to do is to create a file named the same as the txt file but as a xml file. After all of this, you will start writing the xml file. You will start by using the PASCAL VOC annotation format. You also will read the image using cv2 to get its height and width. For the bounding box annotation, you will need to iterate over each bounding box annotation in the txt file and find the xmin, ymin, xmax, ymax, and write them to the xml file. Finally, you would clean up the directory by removing txt files and moving images into a separate directory.

import os
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.apis import train_detector
import glob
import cv2
import shutil
import random
import os.path as osp
import json
import mmcv
import re
import xml.etree.ElementTree as ET
from typing import Dict, List
#Stolen from
def convert_txt(source):

for txt_file in glob.glob(source + '/*.txt'):

f = open(txt_file)
f_str =

lst = list(map(int, f_str.split()))
n = lst[0]

fx = open(txt_file.replace(".txt",".xml"), "x")


fx.write(" <filename>{}.jpeg</filename>\n".format(txt_file.replace(source,"").replace(".txt","").replace("/","").replace("\\","")))
im = cv2.imread(txt_file.replace(".txt",".jpeg"))
h,w,c = im.shape
fx.write(" <size>\n")
fx.write(" <width>{}</width>\n".format(w))
fx.write(" <height>{}</height>\n".format(h))
fx.write(" <depth>{}</depth>\n".format(c))
fx.write(" </size>\n")

fx.write(" <segmented>0</segmented>\n")

for i in range(n):
xmin = lst[(i*4)+1]
ymin = lst[(i*4)+2]
xmax = lst[(i*4)+3]
ymax = lst[(i*4)+4]
fx.write(" <object>\n")
fx.write(" <name>Gun</name>\n")
fx.write(" <bndbox>\n")
fx.write(" <xmin>{}</xmin>\n".format(xmin))
fx.write(" <ymin>{}</ymin>\n".format(ymin))
fx.write(" <xmax>{}</xmax>\n".format(xmax))
fx.write(" <ymax>{}</ymax>\n".format(ymax))
fx.write(" </bndbox>\n")
fx.write(" </object>\n")
!mkdir /kaggle/working/xml-labels
!cp -a ../input/guns-object-detection/Images/. /kaggle/working/xml-labels
!cp -a ../input/guns-object-detection/Labels/. /kaggle/working/xml-labels
convert_txt("/kaggle/working/xml-labels")for file in os.listdir('/kaggle/working/xml-labels'):
if file[-3:] == 'txt':
os.remove('/kaggle/working/xml-labels/' + file)
!mkdir imagesfor file in os.listdir('/kaggle/working/xml-labels'):
if file[-4:] == 'jpeg':
shutil.move('/kaggle/working/xml-labels/' + file, '/kaggle/working/images')

Train and Validation Split

This is a pretty quick part of the process :). All you need to do is create 2 new directories for the validation images and annotations and take a random sample of 30 images and annotations which you move into these new directories.

!mkdir val-xml-labels
!mkdir val-images
val_ids = random.sample(range(1, 333), 30)for ids in val_ids:
shutil.move('/kaggle/working/xml-labels/' + str(ids) + '.xml', '/kaggle/working/val-xml-labels')
shutil.move('/kaggle/working/images/' + str(ids) + '.jpeg', '/kaggle/working/val-images')

Convert from Pascal VOC to Coco

Next, you want to convert the annotation format from Pascal VOC to Coco since its easier to work with Coco annotations in MMDetection. You can use Pascal VOC, but I found it easier to use Coco. To start off, you would want to write a labels.txt file containing the labels which would just be Gun, and write a validation and training filepaths file containing the filepaths of all the xml files in each directory. Then, you use a modified version of this script by yukkyo to convert the Pascal VOC annotations to Coco annotations.

%%writefile labels.txt
#Put the above code in its own separate cell.
f = open("train.txt", "x")
lines = []
for file in os.listdir('/kaggle/working/xml-labels'):
lines.append('/kaggle/working/xml-labels/' + file)
with open('train.txt', 'w') as f:
for line in lines:
f = open("val.txt", "x")
lines = []
for file in os.listdir('/kaggle/working/val-xml-labels'):
lines.append('/kaggle/working/val-xml-labels/' + file)
with open('val.txt', 'w') as f:
for line in lines:
#Stolen from
def get_label2id(labels_path: str) -> Dict[str, int]:
with open(labels_path, 'r') as f:
labels_str =
labels_ids = list(range(1, len(labels_str)+1))
return dict(zip(labels_str, labels_ids))

def get_annpaths(ann_dir_path: str = None,
ann_ids_path: str = None,
ext: str = '',
annpaths_list_path: str = None) -> List[str]:
# If use annotation paths list
if annpaths_list_path is not None:
with open(annpaths_list_path, 'r') as f:
ann_paths =
return ann_paths

# If use annotaion ids list
ext_with_dot = '.' + ext if ext != '' else ''
with open(ann_ids_path, 'r') as f:
ann_ids =
ann_paths = [os.path.join(ann_dir_path, aid+ext_with_dot) for aid in ann_ids]
return ann_paths

def get_image_info(annotation_root, extract_num_from_imgid=True):
path = annotation_root.findtext('path')
if path is None:
filename = annotation_root.findtext('filename')
filename = os.path.basename(path)
img_name = os.path.basename(filename)
img_id = os.path.splitext(img_name)[0]
if extract_num_from_imgid and isinstance(img_id, str):
img_id = int(re.findall(r'\d+', img_id)[0])

size = annotation_root.find('size')
width = int(size.findtext('width'))
height = int(size.findtext('height'))

image_info = {
'id': img_id,
'width': width,
'height': height,
'file_name': filename,
return image_info

def get_coco_annotation_from_obj(obj, label2id):
label = obj.findtext('name')
# assert label in label2id, f"Error: {label} is not in label2id !"
category_id = label2id[label]
bndbox = obj.find('bndbox')
xmin = int(float(bndbox.findtext('xmin'))) - 1
ymin = int(float(bndbox.findtext('ymin'))) - 1
xmax = int(float(bndbox.findtext('xmax')))
ymax = int(float(bndbox.findtext('ymax')))
assert xmax > xmin and ymax > ymin, f"Box size error !: (xmin, ymin, xmax, ymax): {xmin, ymin, xmax, ymax}"
o_width = xmax - xmin
o_height = ymax - ymin
ann = {
'category_id': category_id,
'segmentation': [], # This script is not for segmentation
'area': o_width * o_height,
'bbox': [xmin, ymin, o_width, o_height],
'iscrowd': 0,
return ann

def convert_xmls_to_cocojson(annotation_paths: List[str],\
label2id: Dict[str, int],
output_jsonpath: str,
extract_num_from_imgid: bool = True):
output_json_dict = {
"images": [],
"annotations": [],
"categories": []
bnd_id = 1
print('Start converting !')
for a_path in annotation_paths:
# Read annotation xml
ann_tree = ET.parse(a_path)
ann_root = ann_tree.getroot()

img_info = get_image_info(annotation_root=ann_root,
img_id = img_info['id']

for obj in ann_root.findall('object'):
ann = get_coco_annotation_from_obj(obj=obj, label2id=label2id)
annot = {'id': bnd_id, 'image_id': img_id,}
bnd_id = bnd_id + 1

for label, label_id in label2id.items():
category_info = {'id': label_id, 'name': label, 'supercategory': 'none'}

with open(output_jsonpath, 'w') as f:
output_json = json.dumps(output_json_dict)
def convert_to_coco(ann_path_list='/kaggle/working/train.txt', labels='/kaggle/working/labels.txt', output='/kaggle/working/output.json'):
label2id = get_label2id(labels_path=labels)
ann_paths = get_annpaths(
convert_to_coco(ann_path_list='/kaggle/working/val.txt', labels='/kaggle/working/labels.txt', output='/kaggle/working/val_output.json')

Create Model

Now that you have preprocessed our data, you are ready to create the model you will use. You will create this model by creating a MMDetection config file. MMDetection config files are inheritable files containing all the information about a model from its backbone, to its loss, and even to the data pipeline. There is a config file for each model in the model zoo of MMDetection. You can check out the different configs available for each model in the configs directory of the MMDetection repository. When you click on a model, you should see a ReadMe that looks like this.

ReadMe for RetinaNet shown.

I chose to use RetinaNet with a ResNet-101 backbone. You can choose any model here but you might need to do the next step a little differently than me(You would need to check if the model has a roi_head and if there is change the number of classes of it). For this tutorial, I will show how to use RetinaNet. You will need to download the checkpoint for the specific model you want to use. This will be under the Download column and is the model link in the ReadMe. Once you have downloaded the model, you will make a dataset of it in Kaggle and add that to your notebook as a dataset. Now you will create the config file. You will inherit the config from the specific model and make some changes. You will change some dataset settings, number of classes in bbox head, filepath for checkpoint, learning rate, steps for learning rate, evaluation interval, and seed.

from mmcv import Config
cfg = Config.fromfile('/kaggle/working/mmdetection/configs/retinanet/')
from mmdet.apis import set_random_seed

cfg.dataset_type = 'CocoDataset'
cfg.classes = '/kaggle/working/labels.txt'
cfg.data_root = '/kaggle/working'
cfg.model.bbox_head.num_classes = = 'CocoDataset' = 'labels.txt' = '/kaggle/working' = 'val_output.json' = 'val-images' = 'CocoDataset' = '/kaggle/working' = 'output.json' = 'images' = 'labels.txt' = 'CocoDataset' = '/kaggle/working' = 'val_output.json' = 'val-images' = 'labels.txt'

cfg.load_from = '../input/retinanet/retinanet_r101_fpn_1x_coco_20200130-7a93545f (1).pth'

cfg.work_dir = './'

# The original learning rate (LR) is set for 8-GPU training.
# You divide it by 8 since you only use one GPU with Kaggle. = 0.01 / 8
cfg.optimizer_config.grad_clip = dict(max_norm=35, norm_type=2)

cfg.lr_config.policy = 'step'
cfg.lr_config.step = 7 = 1 = 1
cfg.evaluation.metric = 'bbox'
cfg.evaluation.interval = 4
cfg.checkpoint_config.interval = 12
cfg.log_config.interval = 100
cfg.runner.max_epochs = 24

cfg.seed = 0
set_random_seed(0, deterministic=False)
cfg.gpu_ids = range(1)


Train Model

You are finally ready to train the model. All you need to do is build the dataset, build the model, set model.CLASSES to the classes for visualizations while inferencing, create the work directory, and train the model.

datasets = [build_dataset(]model = build_detector(
cfg.model, train_cfg=cfg.get('train_cfg'), test_cfg=cfg.get('test_cfg'))
model.CLASSES = datasets[0].CLASSES

train_detector(model, datasets, cfg, distributed=False, validate=True)


With training finished, we can inference on some images. I chose to inference on the validation dataset we created in the beginning. MMDetection is very nice in that you can visualize inferences extremely quick. All you need to do is set the model config to the config file earlier, inference on each image, then show the result.

model.cfg = cfgfor i in range(len(val_ids)):
img = mmcv.imread('/kaggle/working/val-images/' + str(val_ids[i]) + '.jpeg')
result = inference_detector(model, img)
show_result_pyplot(model, img, result)
Example of a visualized inference

And that’s it! You have just finished your first MMDetection object detection model. The nice thing about MMDetection is that it is relatively easy to switch models, so try using a different model such as Faster R-CNN as a quick exercise! Thanks for reading!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: