Labels — The Cornerstone of Modern Artificial Intelligence

Original Source Here

Through the buzzword jungle

Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Deep Learning, … Those are quite some terms! As a matter of fact, with so many terms, it almost feels like being trapped in a jungle of buzzwords when entering the adventure to find the most beneficial data-driven use cases in a company.

A nice rule-of-thumb to get an overview in this complex jungle is to state that for modern AI applications, Supervised Learning (SL) is an integral part in most cases. SL is a subset of AI, in which models — such as deep neural networks — learn by examples. We will take a look at how SL works in this blog post to better understand why creating such learning examples is one of the key tasks in implementing modern AI.

Automated pattern recognition

For starters, let’s think about how we could implement a sample use case, such as a risk-analysis model for an insurance company. Using regular programming techniques, we could codify some domain knowledge, such as that people driving sports cars have a higher risk to get into an accident than people driving family cars.

A different approach would be to collect historic data for risk investigations, and to do some statistical comparisons, e.g. comparing the risk amongst different car types. This is a manual data analysis to search for patterns. SL aims to do such an analysis in an automated manner by applying a model on the historic data, which recognizes the data-inherent pattern structure.

Supervised LEARNING

The definitive advantage of SL is that such pattern recognition is automated. However, this advantage comes at the cost of needing labeled records. A label classifies a given data record, e.g. that client “John Doe” is a riskful client. Without such labeled records, no SL algorithm can learn to recognize patterns.

This would not be too much of an obstacle, if the learning models were as data-efficient as us humans are. Sadly, these SL models are not. To correctly recognize and later on apply a pattern with great accuracy, models like deep neural networks often need more than 10,000 labels.

In fact, deep neural networks excel in modern AI applications as they are capable to learn on massive data sets containing millions of labeled records. For instance, if a deep neural net first learns on 100,000 labeled records, it may achieve an accuracy of 95%. If it is trained again at some later point on 500,000 records, it could achieve an accuracy of near 99% — without changing one line of code. Without any labeled records, such great approaches are worthless.

A little metaphor

Forget about SL for a second. Instead think of a group of students that prepare for an exam. They created a batch of index cards, which they now use to learn.

Learning by index cards is quite similar to Supervised Learning.

Some students try to understand the absolute key informations of 20% of the index cards, which yields them 80% of the exam points. Other students are willing to learn all night, as they are so hungry for information.

It is quite the same with SL. There are some algorithms that are designed to be extremely data-efficient. Other models work well on massive amounts of data. What all of them combines is the fact that they need a good training base.


Labeled data is critical to SL. Without training samples, no SL model can learn. The amount of prepared training data can regulate how well a model will perform on a certain use case. Therefore, labeling data is a neccessary first step in each SL project.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: