Original Source Here

Build a Data-Annotation Pipeline with Less than Fifty Lines of Code with Streamlit

Introduction

Most of the machine and deep learning tutorials you will see online use well-known and curated datasets that are publicly available. A popular example is the Iris flower data set from the famous British statistician and biologist Ronald Fisher, who, for his 1936 paper, measured four features of 150 flowers. In the dataset, each of the flowers belongs to one of three species, so we have a label for each measurement. A supervised learning algorithm can then learn to distinguish the different classes.

For real-world problems, we often cannot rely on existing datasets and need to create our own to be able to build models. This can be a cumbersome endeavor, as it was for Ronald Fisher — manually collecting and curating a lot of samples. Sometimes you already have collected raw data and want to do supervised learning but haven’t categorized the data yet. This means going through everything and having human input to assign labels.

Today, several annotation tools exist that can be used for this. Some are highly automated, such as Ground Truth from Amazon, which will provide you with all sorts of informative labels when uploading data. Nonetheless, such tools might not be tailored to the task at hand. When prototyping and working towards a minimal viable product, you should have a tool that allows you to churn through your data and assign your labels at a high pace.

In this article, we will show how we can quickly build such a tool with the help of streamlit in Python.

Photo by Glenn Carstens-Peters on Unsplash

Example: Annotate a dataset for classification

For our example use case, we want to label images for a classification task. This can be your dataset of choice, as an example, the fashion-mnist set. For this tutorial, I use some of my own data. Some time ago, I wrote the following article about dice randomness and recorded ~3.000 images of random throws ( *.png images, 640 × 480 px).

Below is a preview of the dataset showing the first three images:

Example *.png images of the random dice dataset. Each image is 640 × 480 px.

I uploaded a subset of these images (first 100, ~ 10 MB) here so that you can follow along. Feel free to reach out in case you want the full dataset. In the article, I used OpenCV to extract the number of dots from the images automatically. Back then, one commentator found a mislabeled image, showing that the automatic annotation was not perfect. As a noteworthy comment, even the famous MNIST dataset contains 15 label errors.

For the task at hand, we would like to do manual annotation and assign each image to a number in the range of one to six or have it undefined. Note that annotation in image processing is used for multiple different tasks, e.g., drawing regions of interest and assigning a label. Here we use annotation in the sense that we have one image and want a label for it, i.e., the number of dots for each die.

Approach

To build our pipeline, we will use streamlit, an open-source Python library that allows you to build front-ends for data apps quickly. A key feature is that the web server updates as you change the code. This enables you to see directly how the UI changes while coding.

Setup

To get started, you need a working Python environment (e.g., through miniconda).

Install streamlit with pip install streamlit
Create an empty Python file with annotation.py
Copy the downloaded images in the same folder; this should be DICE_IMAGES_100
For this project, additionally, install the Python Image Library: pip install Pillow
Run the file with streamlit run annotation.py
Access the running webserver; the default is http://localhost:8501/

Now let’s add some functionality to this.

Code and UI

The basic event loop of streamlit runs a script from top to bottom. In earlier versions, this meant that selections would be lost across reruns. If we want to store annotations and preserve a state, this would require a workaround. Luckily, streamlit recently introduced a feature to do this: the session state. The release notes included demo code, which can serve as a starting point for annotation projects. For our task, we want to do as follows:

Store annotations as a dict in the session_state.
Show images from the folder and give options for selection: [1,2,3,4,5,6,NA].
Provide a button to download the annotations as *.csv.

Overall, we can achieve this in less than 50 lines of code:

Code for the streamlit annotation tool.

Copy this code in the annotation.py file from above. Streamlit should detect the code changes, and within the browser window, you should see a notification that the source file has been changed. Click rerun to refresh the page.

Running instance

With the code from above running, we can click through our dataset. This should look like this:

Screencast of the streamlit page. The GIF plays in real-time; We can quickly annotate our dataset.

The GIF is in real-time, and I hope one can appreciate that assigning the labels and loading the next image happens very fast, allowing a seamless user experience. Ultimately, we can annotate at a high pace, highlighting the utility of the approach. At any time during the annotation process, we can export the current state of the annotation as *.csv.

Limitations

Arguably, the solution is handy for prototyping but has several drawbacks, which we should briefly discuss:

Concurrency: The session state is working on a per-session basis. This means that if another user connects at the same time, they will annotate the same dataset. Depending on the use case, this can be useful, e.g., when wanting to collect multiple annotations from different users for the same files. On the other hand, when having a lot of files and wanting to distribute the work and only annotate each image once, this will not work. Here, one could write functionality so that each user will get a batch of files to annotate from the entire set.

File integration: In addition, the current implementation is not scalable from a file management perspective. We load the image data from the same folder where the script lies. This will be impractical for larger image sets, and we would rather have an integration to a more extensive file system, e.g., by connecting to cloud storage, such as an S3 bucket.

Bookkeeping of annotations: Exporting annotations as *.csv is rather rudimentary and will be inconvenient when having many datasets or annotation iterations. Here, a connection to a database, e.g., MongoDB, would help keep track of annotated files. Furthermore, one should extend the data that is stored. Recording additional metadata, such as time, user, and file hash, can be interesting in this context.

Conclusion

Streamlit is an extremely fast way to build utilities that can help you in your data science workflow. Here, we used it to build a labeling pipeline. However, some limitations can hinder the approach’s scalability and demands additional integrations. For larger tasks, it might be worthwhile to see if other existing tools are better suited for the job. These days annotation tasks can also be outsourced as part of microwork, which could be a worthwhile alternative. Nevertheless, for the annotation task at hand, the streamlit approach is a valuable solution.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot