Enhancing Security Measures through Clothes Detection


Original Source Here

1. The Big Picture

To start off, building such a complex system requires a perfect interplay between many working parts which as a result contributes to the whole.

Firstly, we have to begin by building modules that work on the whole image. For example, and in order to begin extracting all the interesting information, we need to have a person detector module in order to detect people in a given image, or video frame. We will also have a robust tracker, on top of the first module, to keep track of individuals in the video throughout various timesteps, or frames.

Secondly, we need to build modules that act on the detected people. For example, we will have a gender classification model whose role would be, given a detected person from the video frame, to identify a person’s gender. Another much more complex example is to have a clothes detection model that, given a detected person from the video frame, detects and identifies various clothing types and colors that this person is wearing. Logging all of this data and storing it into a highly structured and easily accessible database, is yet another component of this system.

Each of these components deserves their own blog posts, and in this one, we’ll be focusing on the clothes detection module!

2. Data and Labelling

Fashion-related datasets have always been abundant, from the Fashion MNIST dataset, to the huge sea of web images filled with photos of people posing in various fashion items. Nevertheless, the lack of proper detection datasets has really been a challenge for us as we don’t really care about just feeding a cropped image of an item to a model, and have it guess its type, or class. In fact, our goal is to have a model that takes an image as input, and be able to detect various fashion objects in this image, outputting their class and their location in the image.

In order to build such a model, we had to manually build and label our custom dataset. The images we settled on in the end were mostly composed from the StreetStyle27K dataset. This dataset was impressively diverse and representative. It included a plethora of clothes types, posing scenarios, and a great balance of peoples’ gender, age and ethnicity. A caveat to the StreetStyle27k dataset was that the labels provided with this dataset were not very useful for us, as they only included a single annotation per image, even if the image included multiple people. On top of that, these single annotations were weirdly centered around the peoples’ heads, with plenty of empty space being included in the bounding box.

It is for this reason that we took on the challenge of labelling the images ourselves, and we did that pretty cleverly. After several discussions and iterations, we settled on the following clothes types, or classes:

  • T-Shirt (short sleeve)
  • Sweater (long sleeve)
  • Tank Top (no sleeve)
  • Shirt (formal shirt)
  • Suit (formal jacket/blazer)
  • Outerwear (jacket/coat/etc.)
  • Dress
  • Pants (formal pants/jeans/etc.)
  • Shorts
  • Skirt
  • Hat (beanie/casket/etc.)
  • Hijab
  • Abaya
  • Scarf
  • Glasses *
  • Sunglasses *

* Glasses and Sunglasses have no colors associated with them.

And on the following color classes:

  • Black
  • White
  • Grey
  • Blue
  • Yellow
  • Green
  • Red
  • Violet (Purple)
  • Orange
  • Pink
  • Brown
  • Beige
  • Multicolor*

* Multicolor is anything with an unclear primary color. Even if a shirt is black and white, we label it as multicolor, as one person could say it’s white, and another could argue that it’s black.

The final set of labels were composed of the combination of clothes and color types having the following format: {clothing_class}_{color_class}.

This led to a total of 185 classes! The goal was to make the dataset as versatile as possible, as to how we train on it and how we could quite possibly utilize it for other detection tasks. Let’s take a look at how we utilized the data in the next section!

3. Building and Training the Model

Before training our model, we had to decide on how to feed in our data. Being very resourceful in our data annotation, we had plenty of options to choose from at this stage! For example, we could directly feed in the data as is and try to detect 185 distinct classes. While this is definitely on our checklist, it would require a really (like really) huge dataset in order to produce promising results.

So, while working on building this huge dataset, we decided to test out our model on a subset of 1,000 labelled images, containing a total of more than 8,000 objects. In fact, we decided to merge our labels by their clothes type for this test. So for instance, “pants_black”, “pants_blue”, …, “pants_multicolor” would all be treated as the same “pants” object. This produced a fairly balanced dataset, with a great effect of balancing the representation of each class. This is because in the original set of classes, classes such as “pants_blue” or “shirt_white” were much more common than classes such as “pants_multicolor” or “shirt_yellow”. All in all, this merging step produced a total of only 14 classes.

Keep in mind that having such a versatile labelling schema, will allow us further down the road to try various merging and combinations of labels. We could, for instance, decide to merge all the “upperwear” classes into a single class and the “bottomwear” classes into another class. Or we might even decide to combine the classes by their color types (we actually did this, and got some really interesting results to showcase!).

Now that our data is ready, we’re ready to tackle our model building and training!

For this model, we decided to use transfer-learning as a start, using a ResNet34 backbone architecture coupled with Faster RCNN. This combination turned out to have the best trade-off between speed and accuracy for the current subset of data. After defining several hyper-parameters and adding a couple of subtle data augmentations (e.g., image rotation and contrast shifting), we trained the model for 72 epochs which took around 8 hours to complete.

The numerical results were acceptable, settling at around 50–60 AP (Average Precision) for most classes. However, pure numbers often don’t convey the whole story, especially in object detection tasks. So we took a look at some inferences produced from the testing dataset, and the results were… amazingly impressive!

4. Results!

So without further ado, here are some of the testing results!

This image is particularly impressive! We can see the model detecting clothes objects, both on large and small scale people! Notice the miss-classified shoes in the lower part of the image, confused with a small scale pants object. Nevertheless, the confidence of this detection is around 63% which can be easily disregarded. While there are a couple of missed instances, remember that is our alpha version of the model trained on only 1,000 images! We really can’t wait to see the results we get by training a model on tens of thousands of images!

In this image, we can see a close to perfect detection! This is with the exception of the sunglasses on the person in the rightmost side of the image, almost being hidden by his hair, and the shorts, on the man to his left, being confused with a skirt due to having an odd-looking texture to them.

Moving onto this inference, it yields pretty interesting results! Given that this outerwear object is not physically “linked” through clear “pixels”, the model kind of segmented it into three distinct parts! First the hat, then the two separate, left and right, sections of the outerwear. This inference made us re-evaluate what is considered a “good” detection, and further convey the point we’ve proposed earlier, that pure numbers really don’t convey the whole story!

And here are a couple more inferences produced by our model!

Finally, take a look at some of the results we’ve got when merging the classes according to their color! The model’s new role here would be to detect “any type of clothes that are red”, or “any type of clothes that are white”, etc.

5. Next Steps!

Having built such an impressive proof of concept with merely 1,000 images, our expectations for the future versions of this model are set pretty high!

In the foreseeable future, we expect to aggregate enough data to allow us to feed directly into the model which should make detecting the 185 distinct classes possible!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: