The Unlabeled Data Science

https://miro.medium.com/max/556/0*lqfCg8FcS9amt_xB

Original Source Here

The Unlabeled Data Science

Literally, Predict Something from (Almost) Nothing

Labeled and Unlabeled Data. Source: Kaggle

The Origin of Data

In data science, you often need tons of data to train your machine learning model. This data can originate from various sources such as text, images, or even videos. Some websites such as Kaggle and arXiv have already provided a vast amount of publicly shared datasets that could simply be fitted into a model with a minimum amount of preprocessing required. Unfortunately, this is often not the case in the real world.

Which animal does this fur belong to? Source: Dreams Time

Data can be classified into two categories based on its condition, labeled and unlabeled data. As the name suggests, labeled data refer to data that has already been classified or described. For example, we might have a dataset of some dogs and it is labeled with its species such as dalmatian, puddle, bulldog, etc. On the other hand, unlabeled data only shows the properties of the data without knowing which class or value that it belongs to. Take an example of this data, “An animal with fur”, well it can be a dog, cat, or even a bear and we surely do not know which one is it.

How the Machines Learn

Supervised and Unsupervised Learning. Source: Twitter

Machine learning provides options to generate a model based on the condition of our data. If the data is labeled then we can use supervised learning, and if it is unlabeled we can use unsupervised learning. Unfortunately, there is a downside to choosing between these two methods. How if we want to achieve the goal that can be only accomplished using supervised learning method such as classification, but all that we have is unlabeled data? Well, of course, we can label the data manually, but is there a better method for doing this task?

One of the most common scenarios for a data scientist is to have a lot of data but a very limited time to label it. The data labeling process is surely a burden since it can cost a lot of resources, and not to mention if the data has an extremely large volume. It is unavoidable that in this situation, we ultimately ask ourselves what to do when we only have a little amount of labeled data but a large amount of unlabeled data? The answer is semi-supervised learning.

Semi-supervised Learning

Semi-supervised learning (SSL) is another type of machine learning technique that trains a model using a hybrid method, that is, to incorporate often few labeled data with a huge amount of unlabeled data. This method of machine learning combines supervised and unsupervised learning, hence it is a very interesting topic to be explored. Interests in semi-supervised learning have increased in recent years. This is related to tons of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics.

Text Document Classification. Source: TowardsDataScience

A text document classifier is one of the most frequent examples of a semi-supervised learning application. It would be nearly impossible to find a vast number of labeled text documents anywhere, therefore semi-supervised learning is perfect for this job. This is due to the fact that having someone read through full-text documents merely to assign a simple classification is such an inefficient task. As a result, the semi-supervised learning technique enables the machine learning algorithm to learn from a small number of labeled text documents while classifying a large number of unlabeled text documents in the training set.

Behind the Scene

Semi-supervised learning process. Source: Research Gate

The main method that semi-supervised learning used to train data from a minimum amount of labeled data compared to the unlabeled data is pseudo-labeling that combines various neural network models. Here is how it

  • Train a model based on the labeled data similar to supervised learning
  • Utilize the trained model to predict the unlabeled training data in order to generate a pseudo-output, as this output might not be too accurate
  • Link the labels and inputs from the labeled data with the pseudo-labels from the predicted unlabeled training data
  • Repeat the training process to minimize error and improve the model’s performance

How Well Does It Perform?

One may ask, how does semi-supervised learning performance compare to supervised learning?

Semi-supervised and Supervised Learning Performance Comparison. Source: Towards Data Science

When a low amount of labeled data exists, semi-supervised training does tend to have a better performance compared to supervised learning. At some point though, supervised learning performance will be way better than semi-supervised if we have a large amount of labeled data.

References

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: