A critical analysis of your dataset



Original Source Here

A critical analysis of your dataset

Photo by Brandon Lopez on Unsplash

There are more and more applications of artificial intelligence and increasingly sophisticated models. At the same time, AI is hungry for data, and data collection and cleaning is still a manual approach.

This article brings together considerations and suggestions on why proper dataset collection, cleaning, and evaluation are critical.

Data is the new oil. But like oil, data also needs to be refined

Image from Zbynek Burival

The application of artificial intelligence in both medicine and business has exploded in recent years. In general, more and more companies are investing in artificial intelligence or using machine learning models to analyze their data. On the other hand, it is increasingly easy to be able to incorporate or train AI models. While there are standard libraries such as PyTorch, TensorFlow, and scikit-learn, several autoML solutions are emerging.

However, 90 percent of companies consider data to be the biggest obstacle to developing AI strategy in their business. In fact, companies are often unable to determine in advance the amount of data needed, how to integrate multiple sources, how to take care of data quality, and the ignorance of regulations. The result is that this results in increased costs missed deadlines and problems with regulatory agencies. In addition, a survey conducted by Anaconda shows that more than 60 percent of the time is spent on operations related to data management (data loading, cleansing, and visualization).

There are fields where acquiring and curating data is quite expensive. For example, in the biomedical field, it is both expensive to find patient data (informed consent, permits, cost of samples) and to label the samples (need for experts, and clinicians).

“Torture the data, and it will confess to anything.” — Ronald Coase

In general, the choices that are made during data acquisition and processing affect the final outcome of the model (both at the reliability and generalization levels). An example, when melanoma recognition models were tested on dark tone skins, the area under the curve (AUC) decreased by 10–15 %. In fact, there were few examples of dark tone skin images in the training set and the dermatologists themselves made more mistakes in annotations.

F1-score and ROC-AUC performance of three dermatologic algorithms when applied on Diverse Dermatology Images dataset across skin color (all, FST I-II, and FST V-VI) and rarity of disease (all — DDI and only common diseases — DDI-C). Image from the original article: here

The authors note that by improving the annotation and adding more examples they were able to achieve a better classification of dark skin tone images.

Data-centric versus model-centric view

equilibrium from model-centric and data-centric. image by Artem Kniaz at Unsplash.com

Andrew NG on youtube states that 99% of articles are model-centric and only the remaining 1% of articles are data-centric. But what is meant by model-centric and data-centric?

  • Model-centric AI. The dataset is considered fixed, and the researchers’ focus is to optimize the model architecture so as to achieve the best result in terms of accuracy
  • Data-centric AI. Instead, the purpose is to focus on methods to improve the data pipeline (selection, annotation, cleaning, and so on).

“Man is what he eats.” — Ludwig Feuerbach. Is it the same for AI models? are they the data they devour?

Much of the articles focus on improving a model (changes in architecture or training) and evaluating it on the standard benchmark datasets. These datasets are not error-free and should not be used without critical analysis. In the data-centric approach, the dataset is also under the eye of the researcher and can be modified.

In addition, the model-centric approach has enabled the exponential improvement of artificial intelligence models, but today the increase in accuracy is often vanishingly small. Therefore, we need new datasets but at the same time an approach that reevaluates and improves existing datasets.

most articles almost always use the same standard datasets. Shown in the figure are the main datasets and their percentage use across articles. image source: here

For example, more attention should be paid to data quality during collection but also to enrich the dataset with metadata. In fact, 90% of articles on AI in dermatology do not present information on skin tones. Thus, in the next sections, we will discuss how to improve the critical points of a given dataset.

An intelligent design for data collection

How to create an intelligent data pipeline. Adam’s creation from Michelangelo (image source: here)

“Data is like garbage. You’d better know what you are going to do with it before you collect it.” — Mark Twain

When you want to design a new AI application, you have to have in mind what the task is. Although choosing the model is critical, choosing the data source is also critical. Most often, the dataset is downloaded and once processed remains fixed. Instead, we should follow a dynamic approach, in which an initial dataset is collected and initial analyses are done to check for bias.

In addition, we need to be sure that our sample is representative of the population. A classic example is Simpson’s paradox (when a visible result or trend using the entire dataset, disappears or reverses when dividing the data into groups).

A visual representation of Simpson’s paradox. Using a regression model on all the data, there seems to be an obvious trend that disappears, however, using the groups separately. Image source: here

This aspect is critical when collecting a dataset to allow the model to generalize. Unfortunately, most datasets are biased and collected in only a few countries (without adequate representation of minorities and other states). This is problematic when developing algorithms for medical applications.

Geographic distribution of countries in the Open Images data set. image source: here

There are approaches that have been designed to remedy this:

  • Improving data coverage. Involve communities and pay attention to inclusiveness. For example, from the design, the BigScience project has made inclusion one of its principles. Or there are projects such as the Common Voice project that collects speech transcriptions in 76 languages from more than 100,000 participants, allowing for the inclusion of languages that are usually overlooked.
  • Synthetic data. while collecting medical data or images of human faces is costly and potentially risky for privacy, the use of synthetic data allows for both lowering costs and preserving privacy. The use of synthetic data seems promising for medicine, robotics, computer vision, and so on.
Real and synthetic samples of head MRI. image source: here

The good news is that several companies have also made efforts to improve inclusion (an example Meta with No Language Left Behind). Nonetheless, these efforts are still in their early stages. A great many projects use the same benchmark datasets again and again, and the use of synthetic data is still suboptimal (performance is suboptimal, synthetic data may itself be biased).

Therefore when collecting a dataset it is critical to associate metadata that must present statistics about sex, gender, ethnicity, and geographical location. Similarly, both scientific journals and conferences should require this metadata.

Sculpting the data in a masterpiece

David by Michelangelo. Image source (here)

“Data that is loved tends to survive.” — Kurt Bollacker

AI models have shown outstanding performance in several fields, but they can overfit training set biases and label noises. During data collection, annotation and labeling are considered a bottleneck. In fact, it can not only be expensive but also full of errors. To try to lower costs, companies rely on crowdsourcing platforms such as Amazon Mechanical Turk, but the results are not always of quality. On the other hand, medical or LIDAR images would require annotation done by experts (even more expensive).

Several solutions are being studied to speed up the annotation process. For example, users can provide functions (or general rules) to annotate data, and the algorithm helps to aggregate these initial labels. Alternatively, an algorithm selects the most important data points (greater information gain, greater uncertainty) and a person annotates them (human-in-the-loop).

In other cases, the problem is data scarcity. In computer vision, image augmentation techniques (rotation, scaling, flipping, and so on) are often used to increase the number of examples in a dataset. Libraries also exist today to augment tabular and even text data.

image augmentation. image source: here

On the other hand, neither the benchmark datasets are clean and error-free. In fact, several studies have shown that several errors are present (misannotations, incorrect labels, and so on). For example, the validation set of ImageNet (one of the most popular datasets for image classification) contains at least 6% incorrect labels. Thus, even when the dataset is collected the work is not finished but it should be dynamically checked.

Examples of errors in standard benchmark datasets. image source: here

One can try different regularization techniques or use reweighting techniques. Unfortunately, these are often expensive solutions. Another approach is to use the data Shapley score, filter out the data that are of poor quality, and retrain the model on the cleaned dataset. This approach also has the advantage that it allows us to analyze the behavior of the model in the presence/absence of certain data (and thus also assess certain biases).

Data Shapley quantifies how important each data point is and how much it varies model performance if that data point is removed. image source: here

No easy exam

Image from Green Chameleon at unsplash.com

“We are surrounded by data, but starved for insights.” — Jay Baer

After several adjustments to the model, having tried different architectures and so on, it finally comes the coveted evaluation time. The paradigm is the division of the dataset into training, validation, and test set. The test set must be set aside so as to avoid potential data leakage. Do we need to know just that?

Actually, one would first need to check that the test set is representative (containing enough examples of the various classes, for example). Even this is sometimes not enough.

The purpose of the test set is to test the model’s ability to generalize. For example, it was shown that a model trained on x-rays from one hospital was unable to generalize when tested on images collected from other hospitals. Therefore, a test set should be carefully designed: images should not come from the same source, ideally they should be annotated by a different expert.

AI model fails to generalize, the model is finding a correlation between the image and the hospital where it is registered. Image source: here. License: here

For example, it was noted that models that seemed to have exceptional accuracy could completely miss predictions in response to small changes (change of background, different context for the same object). In fact, it was observed that many AI models implement so-called “shortcuts” or heuristic strategies. A classic example is that if in the training set all examples of cats are on the couch, the model might associate the couch with the label “cat” and be unable to recognize a cat that is in a different context.

Example of shortcuts used by neural networks. image source: here

These spurious correlations may also be present in medical applications (the model might recognize something that is associated with a particular hospital and then fail with images of patients from other hospitals). Moreover, the phenomenon is not restricted to computer vision but these shortcuts can also occur using Natural Language Processing (NLP) models. In fact, the model often learns some associations between the first and the second part of the sentence that are actually just spurious correlations.

Data ablation studies have been suggested as a potential remedy to understand what shortcuts the model uses and to correct this behavior.

Example of data ablation study for vision transformers. In this example, in the training set, there are seagulls (water birds and the background is always water) and land birds with a forest background, if in the test set there is a seagull in the forest it would probably classify as a land bird. Here patches are removed to study the robustness of the model. image source: here

In addition, typically the evaluation of the model is reduced to a single number (an evaluation metric such as accuracy, AUC, and so on). This can often be misleading: a typical example is an imbalanced dataset where the model can achieve high accuracy by predicting the most abundant class. In addition, although the general accuracy might be very good, the model might have systematic errors in specific subgroups of data. These errors could be dangerous, as in the case of minorities, gender, and geographic origin.

For example, facial recognition algorithms have been shown to classify minorities less accurately with the risk of harmful bias. Several studies have addressed how we can mitigate bias and increase the fairness of algorithms. Multiaccuracy is a framework developed to ensure accurate predictions across identifiable subgroups. However, metadata is not always available, so an algorithm was developed that identifies clusters in the test set alone where the model is at risk of error (DOMINO)

In addition, there are also cases where data collection is dynamic. Even the AI task itself can change (domain shift). For example, an autonomous vehicle needs to recognize new types of vehicles or a new type of traffic signal. It would be quite expensive to retrain the model, but doing the model update also raises new questions about how to evaluate it. Therefore, today MLOps is becoming one of the fastest growing fields. Libraries such as TFX and MLflow include features for cases like these and being able to analyze the dataset with agility.

Parting thoughts

Photo by Brett Jordan at unsplash.com

Data quality is key to the success of a good artificial intelligence application. The choice of how to collect and manipulate data is crucial although often in model-centric AI data scientists focus on how to improve the model.

How collect the data or other operations also opens up ethical issues and can lead to potential bias. Moreover, these choices are not only important but increasingly relevant today (as I discussed in a previous article, institutions are regulating artificial intelligence). In fact, studies have shown that even standard datasets are not free of errors and biases (ImageNet also contains potentially offensive labels).

In addition, the evaluation of a model can present critical issues. As we have seen, the model may be less accurate with some groups or exploit shortcuts. Although we have seen several possible technical solutions, many of these problems could be solved with better data collection and curation.

Because there is often underrepresentation of particular groups and categories (minorities, languages outside of English, other countries) many targeted projects have arisen. For example, the COCO dataset is one of the best known and used for segmentation tasks but it presents typical Western objects and scenes. A few years ago was established COCO Africa, which shows scenes and objects that can be encountered in Africa. Similarly, there are many other projects dedicated to languages that are scarcely present in datasets.

COCO Africa (image source: GitHub official repository)

The model-centric approach has dominated AI in the last decade, but the data-centric approach is now growing. In fact, explainable AI is more and more important and data-centric is part of it. In addition, one must always maintain a critical attitude toward one’s data (whether we have collected it ourselves or whether it is a benchmark dataset) because potential errors and biases may have been missed.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: