Uncovering bias in the PlantVillage dataset

Original Source Here

Figure 1: Example images from the PlantVillage dataset. [Image by author]

Uncovering Bias in the PlantVillage Dataset

A critical evaluation of the most famous plant disease detection dataset used for developing deep learning models

Plant diseases are responsible for 20 to 40% of crop losses globally each year. Disease detection and identification play an essential role in disease management for minimizing crop losses, and since visual inspection is crucial for disease detection deep learning is a natural fit for this problem.

The artificial intelligence revolution started in the early 2010s when convolutional neural networks dominated the computer vision competitions. The real value of AI, however, was understood when it started tackling challenges in other domains such as medicine and physics. Today, machine learning has become an indispensable tool in plant science. It has found wide-ranging applications such as classifying plant cell organelles, high-throughput root phenotyping, and estimating crop growth using drone images. Even though machine learning was used for plant disease identification as early as 2007[1], a lack of large public datasets prevented further studies. This
changed when the first extensive and public plant disease dataset, PlantVillage, was published in 2015 [2].

The PlantVillage dataset is the largest and most studied plant disease dataset. It contains more than 54,000 images of leaves on a homogenous background. There are 38 classes corresponding to plant-disease pairs. This sparked a plethora of studies on plant disease classification using deep learning. Most of the papers reported classification accuracies above 98% [3].

Yet the trained models are tested on a subset of the PlantVillage dataset. Therefore if the dataset has a bias problem, it won’t be detected. While looking at the images from this dataset, it seemed to me that the capture conditions are different between classes. To check if I am hallucinating or not, I did a simple experiment.

The PlantVillage dataset contains 54,305 single leaf images from 14 crop species (Figure 1). There are 38 classes named as species_disease or species_healthy. The leaves were removed from the plant, placed against a grey or black background, and photographed outdoors with a single digital camera on sunny or cloudy days.

I reduced this dataset into 8 pixels; four from the corners and four from the centers of the sides — pure noise (Figure 2). Figure 3 shows example images from Figure 1 reduced to 8 background pixels. This dataset is called PlantVillage_8px.

Figure 2: A) Location of the 8 pixels. B) 8 pixels close view [Image by author]
Figure 3: Example images from the PlantVillage_8px dataset [Image by author].

In order to quantify the amount of bias in the PlantVillage dataset, I trained and tested a machine learning model on the PlantVillage_8px dataset. If there is no bias, the model should not be able to beat the random guess accuracy, which is defined as 100/number_of_classes % for a balanced dataset. For this dataset, random guess accuracy is around 2–3%.

If there is no bias, the model should not be able to beat the random guess accuracy of 2–3%.

I used scikit-learn’s random forest classifier implementation with default hyperparameters to train the model. To be comparable with works that developed models on PlantVillage, the dataset was randomly split into a training set (80%) and a test set (20%). Classification accuracy was used to evaluate model performance.

The model achieved 49% accuracy using pure noise! This indicates significant dataset bias in the PlantVillage dataset. Since the foregrounds and the backgrounds are not correlated contextually, minimal background bias is expected. Therefore, capture bias must be the main reason for the dataset bias.

The model achieved 49% accuracy using pure noise!

This means that the models developed on this dataset will experience significant performance drops even on new datasets collected on similar conditions, let alone field data. Note that this underestimates the dataset bias because capture bias influences both the background and foreground, whereas the model used only a fraction of the background. Moreover, the random forest model was trained with the default hyperparameters without any tuning to improve its performance.

At the end of the day, the best way to deal with a biased dataset is to avoid collecting it in the first place. Design of Experiments, a branch of statistics, laid out the principles of efficient and proper data collection. The salient idea is to determine the noise factors before data collection and ensure they are either controlled for or randomized. If one must work with a biased dataset, the first step is to understand the bias sources and quantify them. Once this is done, bias can be decreased by either removing it or negating it with additional data collection. The most critical step is to collect a separate dataset that matches the use case and report the model performance on this dataset. This will provide a reliable estimate of the model performance.

This experiment identified and quantified the dataset bias in the PlantVillage dataset. As data scientist we are responsible for creating reliable models, not just reporting a seemingly high accuracy on a biased test set. We should be diligent when using this and similar datasets for developing machine learning models.


[1] K. Huang, Application of artificial neural network for detecting Phalaenopsis seedling diseases using color and texture features (2007), Computers and Electronics in Agriculture

[2] D. P. Hughes and M. Salathe, An open access repository of images on plant health to enable the development of mobile disease diagnostics (2015), arXiv

[3] K.P. Ferentinos, Deep learning models for plant disease detection and diagnosis (2018), Computers and Electronics in Agriculture


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: