Wait For Labeled Data or Not? This is the Question

Original Source Here

Wait For Labeled Data or Not? This is the Question

Human key-points estimation is a natural way to represent humans in the digital visual domain and is one of the fundamental tasks in computer vision. It has a wide variety of computer vision applications such as action recognition, pose estimation, animation, tracking, etc. However, it poses a significant engineering challenge due to the possible variations in occlusion, scale, illumination, scene type, and more.

Human Key-points Marking Example

Developing a supervised human pose estimation model on your own dataset and for your own needs requires both relatively massive data collection and cumbersome annotation work, the main trade-off is the amount of annotated data vs cost and time-to-market. At this point, you have two basic options: either wait for a large amount of data which leads to a long development cycle, only after which you can find mistakes, or improve your models iteratively and with more techniques, leaving you more prone to overfit and that will have to be addressed as well. We chose the second option.

A short glimpse of model results

In this post, I’ll share the main topics that helped us to tackle those challenges, from data and algorithmic aspects. I’ll review the topics, in the same order that we developed the model, starting from the problem statement and progressing through Algorithmic improvements.

  • Problem Statement
  • Evaluation Metrics
  • Architecture
  • Data Collection
  • Algorithmic Improvements

Problem Statement

Before we continue, let me tell you about Nanit a bit. Our products and services create a support system for parents, getting them closer to their child’s development than ever before. As part of the support system for parents, Nanit provides height measurement via the Smart-Sheets product and a better understanding of a child’s physical developmental milestones.

We at Nanit found the previously mentioned skeleton model very useful for our case as well, and decided to set the following goals

  1. Develop a skeleton model for the height measurement problem
  2. Develop a skeleton model for Nanit’s diversely distributed data to serve multiple purposes

Evaluation Metrics

Quick time to market is a critical aspect of any company’s success, and finding a short development path to a quality product helps Nanit to achieve its company goals. But how does one define the quality of a product?
Defining evaluation metrics from the beginning helps one stay focused and verify that each step leads us towards the project goals. With the project goals in the back of our mind, we asked ourselves how we can evaluate our progress during the project, from the user side, and how to “translate” it to the skeleton model. That led us to use the following evaluation metric.

Height Error

This metric is testing the end usage of the skeleton model in height measurement. The height measurement algorithm is derived from a set of human key-points. We aimed to test the user experience and accuracy of height measurement using our skeleton model, by checking the height accuracy by standard metrics on a dedicated test-set.


After we’ve defined the user metric we can go one step further and define the skeleton model evaluation metric. The challenge is to find a model metric with a high correlation to the user experience. Finding such a metric means that better model evaluation will lead to a better user experience.

The PCKh metric evaluates the performance of the skeleton model itself. PCKh, is a head-normalized version of Percent Correct Key-points, as defined in the MPII dataset (One of the common datasets for human pose estimation). A joint is correct if it’s within a radius of αᐧl pixels from the GT location. ‘α’ is a constant e.g PCKh@0.1 (α=0.1), ‘l’ is relative to the head bounding box.

It is worth noting, that OKS (Object Key-Point Similarity) is commonly used in the COCO key point dataset. However we decided not to use it, instead, we followed the MPII metric and used PCKh@0.1 for skeleton model evaluation. We chose to set α=0.1 because height measurement is more sensitive to key-point localization.


First, we searched for architectures with good performance that were trained and evaluated on leading datasets such as MPII and COCO. Our case requires accurate localization, therefore we looked for an architecture that maintains the high-resolution information (See diagram below) necessary to achieve this.

We chose HRNet, a leading top-down Human Pose Estimation architecture (checkout PapersWithCode) based on heat-map regression. Top-down architecture means first localizing the human and then estimating the joint’s locations.

HRNet Architecture

The loss function of this model is a Mean Square Error between the predicted heat-map and the GT heat-map, for visible joints only. The GT heat-map is a gaussian distribution around the joint’s GT location.

For our height measurement task, we chose to start with a pre-trained HRNet model based on MPII due to the high overlap of joints between our task and the MPII dataset. To use the skeleton model at Nanit’s height measurement, we needed to add additional joints. In practical terms, this means adding more heat-maps as output channels to the model. We found that initializing those new joints from the nearest joints in the pre-trained model worked best.

We also noticed that in typical human-centric images people usually have their heads up πŸ™‚ We took advantage of the top-down approach to maintain the head position in the upper part of the image during the model training we added a pre-processing phase that rotated the image to keep the heads in the upper part of the image.

Data Collection

Skeleton annotations is a long and demanding task, it is challenging to scale and develop a robust model within a short time-to-market, within cost constraints, and with a limited amount of data. Data annotation is the pain point in all data-driven projects. Therefore, data management should be done carefully and alongside a verification process that ensures one is going on the right path towards one’s goals.

The main challenges for a skeleton model are the sensitivity to human postures, and the model’s ability to learn with occlusions. Due to the complexity of data annotation and the challenges mentioned above, we decided on an iterative approach. Data collection is an expensive process in terms of both time and money. An iterative process helped us stay focused and to evaluate our model continuously to keep moving steadily towards the project’s goals while at the same time making sure not to spend too much time and money, and further, reducing the noise level through rapid feedback to the annotation team.

We denote by ⅅ the desired amount of annotated data. For the Proof-Of-Concept, we marked 10% of ⅅ. After we had successfully finished the POC, we focused on making the skeleton model work for Nanit’s height measurement. The data collection aimed to meet the requirements of this goal, and that dictated our desired data distribution in terms of human posture, occlusions, sensor type, scene, etc. The first iteration consisted of an additional 40% of ⅅ, and focused on some of the postures but with a low probability for occlusions.

In each iteration, after training on a new data chunk, we evaluated the model’s performance according to the aforementioned metrics. The outcome of the evaluation should be the kind of data the model needs to improve with respect to the project’s goals and metrics.
Additionally, coding, literature surveys, and finding ways to improve the model, with limited data constraints, are key components to project success in time. They should be done continuously during the data collection iterations.

After we had a working model that we could rely on we had a good baseline to start to tackle the holistic pose estimation task “Nanit in the Wild”, with an addition of 50% of ⅅ, again in an iterative way.

Algorithmic Improvements

The algorithmic improvements presented below are set in the same order of experiments and timeline they were conducted. We started from common methods, data augmentations through weights freezing policy to advanced methods, such as a different way to represent skeleton’s joints to achieve better model performance.

Data Augmentations

When dealing with limited data for training, data augmentation is a fundamental approach to improve model performance and reduce overfitting. While building an augmentation mechanism it is important to preserve the augmented data similar to the original problem distribution. A sequence of augmentations might yield non-realistic images that don’t represent the desired distribution of the problem we are aiming to solve, and therefore will result in lower performance on real data.

We came up with a mechanism that uses a few types of augmentations:

  1. Geometric Based
  2. Pixel Based
  3. Noise Based

While it makes sense to apply many Geometric augmentations in a sequence, applying many Pixel or Noise-based augmentations will cause non-realistic images. To address this issue we set a mechanism that includes limitations on applied augmentations according to their type.

Choosing which augmentation to use is a bit of a trial and error. We found out that using RandAugment helped us choose our augmentations for skeleton model training and reducing overfit (See Development Timeline Section)

Test Time Augmentations

Test Time Augmentation (TTA) is a very powerful mechanism that takes advantage of data augmentation during testing to improve model performance.
Following the HRNet paper, we chose to use TTA for our model. The basic augmentation is flipping horizontally the input image and calculating the joint’s locations by the average of the outputs.

Additionally, we examined more TTAs including rotation of input image by a few predefined angles (e.g. ±15 deg). The trade-off of TTA is the additional inference time vs improved accuracy.

We choose to continue with the basic TTA due to inference time constraints.

Joints Weights

Our goal is to provide a skeleton model for height measurement, for this kind of usage the error is strongly coupled with the baby’s ‘edge’ joints (for example, top of the head, bottom of the legs). To inject the Joints’ effect into the model we applied different model weights for the desired joints in the loss function, which allows us to focus the skeleton model on the important joints for the height measurement application.

In the graph below you can find an example of joint weights effect on model performance metric MeanSmS@0.1. MeanSmS@0.1 is the mean PCKh@0.1 metric calculated only on the height measurement applying relevant joints.

MeanSmS@0.1 with different joints weights

Freeze Layers & Heat-map Size

Another common method for transfer learning is freezing layers of the pre-trained model. For initial training, we started with freezing all layers except for Stage4 of HRNet. Each HRNet stage is composed of a sequence of convolutions and fusing different resolutions. As the training process moved on we examined the option to freeze more layers (Stage4 sub-modules), to reduce overfit (due to small amounts of data). While attempting to reduce the overfit we encountered an under-fit issue.

We thought of a way to improve performance by enlarging the heat-map size. Key-points methods that are based on heat-maps (as in HRNet) might have performance limitations due to quantization in the heat-maps stage. In our case, the pre-trained model is trained on 64×64 heat-map size.

We conducted two experiments, both w/o freezing any layers

  1. Output of 96×96 sized heat-maps (pre-trained model is with 64×64)
  2. Output of 64×64 sized heat-maps(for comparison)
Mean@0.1 with different freeze layers

We found out that by unfreezing more layers the performance improved. We obtained the best results by unfreezing all layers, almost regardless of heat-map size. This might be counter-intuitive due to our small amount of data, we found a possible explanation in the literature called, Deep Double Descent.


Challenging the optimizer choice was our next task. Adaptive optimizers like Adam have become a default choice for training neural networks. When working with limited data, model generalization is a challenging task. Regularization is a common method to improve model generalization. One of the options is weight decay, but Adam’s weight decay doesn’t work properly.

AdamW is an optimizer that made Adam work with weight decay properly and hence generalize better and yield better models. Although AdamW is widely used in transformers it is less common to use it in CNN-based neural networks like HRNet. Nonetheless, we found it useful and it improved our results by ~0.3%

Mean@0.1 Optimizer Comparison

Distribution Aware Representation Key-Points

A novel Distribution-Aware coordinate Representation of Key-Point (DARK) method serves as a model-agnostic plugin. DARK significantly improves the performance of a variety of state-of-the-art human pose estimation models that are based on heat-maps.

What does DARK do?

  • Generation of modulated heat-maps, assuming a gaussian distribution
  • Estimation of function for the values around extreme points, by Taylor Series
  • Re-localization of joint coordinates by calculating the extreme point

In our case, the DARK usage gained an improvement of ~0.9% in the Mean@0.1 evaluation metric.

DARK Improvement

Development Timeline

In this part, you can see all the experiments and data iterations on a dedicated timeline. Putting this all together clearly demonstrates each algorithmic improvement’s benefit and data addition.
The tests were done during the development process – we used the tests on each step to make sure we stayed focused and we were working towards the project’s goals. We verified that improvement in the skeleton model (PCKh model metric) yielded improvements in height measurement (user experience metric). Indeed, the correlation between skeleton PCKh@0.1 (On relevant joints, MeanSmS@0.1) performance and height accuracy was high, thus giving us the confidence that we were looking at the right metrics.

Development Timeline

The main challenge when working on limited data and in an iterative process is an overfit on the training data. We defined overfit by the difference between PCKh@0.1 on the training and validation sets. The main contributors to overfit reduction were our augmentation policy, TTA, and, of course, data addition.

Overfit Reduction Over Time


In this blog post, we went through defining a problem and what we wanted to solve, setting our goals and taking confident baby steps towards the solution. We’ve shared how we’ve dealt with some real-world constraints such as cumbersome data collection and annotation tasks (to name a few). In the algorithmic aspect, we’ve shared our path, step by step, how we continued to improve the model performance by using basic to advanced methods.

The first main takeaway from this development is to find and define the right metrics from the very beginning – metrics that will allow us to check the actual impact on the product/business KPIs, user experience in this case – and how to transform those metrics into model metrics. The second main takeaway is you will never have enough data, so be creative with the data you have to deliver a quality model in time.



  1. HRNet Paper https://arxiv.org/pdf/1902.09212.pdf
  2. RandAugment paper https://arxiv.org/pdf/1909.13719.pdf
  3. Deep Double Descent paper https://arxiv.org/pdf/1912.02292.pdf
  4. ADAM paper https://arxiv.org/pdf/1412.6980.pdf
  5. ADAMW paper https://arxiv.org/pdf/1711.05101.pdf
  6. Distribution-Aware coordinate Representation of Key-Point paper https://arxiv.org/pdf/1910.06278.pdf


  1. MPII http://human-pose.mpi-inf.mpg.de/
  2. COCO https://cocodataset.org/#home
  3. OKS https://cocodataset.org/#keypoints-eval

Experiment management Platform by https://wandb.ai/


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: