Enhance Your ML Experimentation Workflow with Real-Time Plots


Original Source Here

Image generated using Stable Diffusion

Enhance Your ML Experimentation Workflow with Real-Time Plots

Part 2 of the tutorial on how to run and evaluate experiments without leaving your IDE

In the previous article of this series, I demonstrated how to use DVC’s VS Code extension to transform our IDE into an experimentation platform, allowing us to directly run and evaluate ML experiments. I also mentioned that the extension offers useful plotting functionalities, which enable us to visualize and evaluate the performance of our experiments using interactive plots. To make it even better, the extension also offers live plotting of certain metrics during the training phase. You can get a sneak peek of this feature in the following figure.

Source, GIF used with permission by iterative

This article will demonstrate how to enhance the previously-introduced experimentation workflow by monitoring model performance and evaluating experiments with interactive plots, all within VS Code. To achieve this, we’ll tackle a binary image classification problem. First, we will provide an overview of transfer learning in computer vision and share some details about the selected dataset.

Problem definition and methodology

Image classification is one of the most popular tasks in the field of computer vision. For our example, we will use the cat vs dog classification problem, which has been widely used in the research community to benchmark different deep learning models. As you might have guessed, the goal of the project is to classify an input image as either a cat or a dog.

To achieve high accuracy even with limited training data, we will leverage transfer learning to speed up the training process. Transfer learning is a powerful deep learning technique that has recently gained significant popularity, especially in various domains of computer vision. With the vast amount of data available on the internet, transfer learning allows us to leverage existing knowledge from one domain/problem and apply it to a different one.

One of the approaches to using transfer learning for computer vision is based on the idea of feature extraction. First, a model is trained on a large and general dataset (for example, the ImageNet dataset). This model serves as a generic model of “vision”. Then, we can use the learned feature maps of such a model without having to start the training of a custom network from scratch

For our use case, we will utilize a pre-trained model (ResNet50) to extract relevant features for our binary classification problem. The approach consists of a few steps:

  1. Obtain a pre-trained model, i.e., a saved network that was previously trained on a large dataset. You can find some examples here.
  2. Use the feature maps learned by the selected network to extract meaningful features from images that the network was not trained on.
  3. Add a new classifier on top of the pre-trained network. The classifier will be trained from scratch since the classification component of the pre-trained model is specific to its original task.

We will show how to do all of this in the following sections. However, please bear in mind that this is not a tutorial on transfer learning. If you would like to learn more about the theory and implementation, please refer to this article or this tutorial.

Getting the data

By using the following snippet, we can download the cats vs. dogs dataset. The original dataset contained 12500 images of each class. However, for our project, we will be using a smaller, filtered dataset that contains 1000 training images and 500 validation images per class. The additional benefit of downloading the filtered dataset via TensorFlow is that it does not contain some corrupted images that were present in the original dataset (please see here for more information).

import os
import tensorflow as tf
import shutil

DATA_URL = "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip"
DATA_PATH = "data/raw"

path_to_zip = tf.keras.utils.get_file(
"cats_and_dogs.zip", origin=DATA_URL, extract=True
download_path = os.path.join(os.path.dirname(path_to_zip), "cats_and_dogs_filtered")

train_dir_from = os.path.join(download_path, "train")
validation_dir_from = os.path.join(download_path, "validation")

train_dir_to = os.path.join(DATA_PATH, "train")
validation_dir_to = os.path.join(DATA_PATH, "validation")

shutil.move(train_dir_from, train_dir_to)
shutil.move(validation_dir_from, validation_dir_to)

The following tree presents the structure of the directories containing the downloaded images:

┗ 📂raw
┣ 📂train
┃ ┣ 📂cats
┃ ┗ 📂dogs
┗ 📂validation
┣ 📂cats
┗ 📂dogs

In case you would like to use the complete dataset for your experiments, you can load it using tensorflow_datasets.

Experimenting with Neural Networks

In this section, we will show the code used for training and experimenting with our neural network classifier. Specifically, we will need the following three files:

  • train.py — contains the code used for training the neural network.
  • params.yaml — contains the parameters used for training the neural network, such as the size of the input images, batch size, learning rate, number of epochs, etc.
  • dvc.yaml — contains the DVC pipeline, which stores information about all the steps that are executed within our project, including their respective dependencies and outputs. For a more thorough description of this file and its structure, please refer to my previous article.

As a matter of fact, our current setup is more advanced than the bare minimum. While we could have started with just the training script, we chose to implement a more sophisticated setup right from the start. This will allow us to conveniently run experiments in a queue and easily parameterize them, among other benefits.

Let’s start with the dvc.yaml file as it contains this project’s pipeline. As this is a relatively simple project, it only has one stage called train. In the file, we can see which script contains the stage’s code, what its dependencies are, where the parameters are located, and what the outputs are. The outs step contains a directory that does not exist yet (dvclive), which will be automatically created while running our experiments.

cmd: python src/train.py
- src/train.py
- data/raw
- train
- models
- metrics.csv
- dvclive/metrics.json:
cache: False
- dvclive/plots

Let’s proceed to the params.yaml file. We have already mentioned what it contains, so its contents should not come as a surprise:

image_width: 180
image_height: 180
batch_size: 32
learning_rate: 0.01
n_epochs: 15

Naturally, the file can contain many more parameters for multiple stages of the project, which are defined in the DVC pipeline.

Finally, we proceed to the file used for training the neural network. To make it more readable, we will break it down into three code snippets. In the first one, we execute the following steps:

  • Import the necessary libraries.
  • Define the data directories separately for the training and validation datasets.
  • Load the parameters from the params.yaml file.
  • Define the training and validation datasets using the image_dataset_from_directory functionality of keras.
import os
from pathlib import Path
import numpy as np
import tensorflow as tf
from dvc.api import params_show
from dvclive.keras import DVCLiveCallback

# data directories
BASE_DIR = Path(__file__).parent.parent
DATA_DIR = "data/raw"
train_dir = os.path.join(DATA_DIR, "train")
validation_dir = os.path.join(DATA_DIR, "validation")

# get the params
params = params_show()["train"]
IMG_WIDTH, IMG_HEIGHT = params["image_width"], params["image_height"]
BATCH_SIZE = params["batch_size"]
LR = params["learning_rate"]
N_EPOCHS = params["n_epochs"]

# get image datasets
train_dataset = tf.keras.utils.image_dataset_from_directory(
train_dir, shuffle=True, batch_size=BATCH_SIZE, image_size=IMG_SIZE

validation_dataset = tf.keras.utils.image_dataset_from_directory(
validation_dir, shuffle=True, batch_size=BATCH_SIZE, image_size=IMG_SIZE

The second part of the training script contains the definition of the neural network architecture that we want to use for this project.

def get_model():
Prepare the ResNet50 model for transfer learning.

data_augmentation = tf.keras.Sequential(

preprocess_input = tf.keras.applications.resnet50.preprocess_input

base_model = tf.keras.applications.ResNet50(
input_shape=IMG_SHAPE, include_top=False, weights="imagenet"
base_model.trainable = False

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
prediction_layer = tf.keras.layers.Dense(1)

inputs = tf.keras.Input(shape=IMG_SHAPE)
x = data_augmentation(inputs)
x = preprocess_input(x)
x = base_model(x, training=False)
x = global_average_layer(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)


return model

We will not dive deeply into the code used for transfer learning, as it is slightly outside the scope of this article. However, it is worth mentioning that:

  • We used some very simple image augmentation techniques: random horizontal flip and random rotation. These augmentations are only applied to the training set.
  • While training the model, we want to track its accuracy. We chose this metric because we are dealing with a balanced dataset, but we could easily track additional metrics such as precision and recall.

The third and last snippet contains the main body of our script:

def main():
model_path = BASE_DIR / "models"
model_path.mkdir(parents=True, exist_ok=True)

model = get_model()

callbacks = [
model_path / "model.keras", monitor="val_accuracy", save_best_only=True

history = model.fit(

if __name__ == "__main__":

In this snippet, we do the following:

  • We create the models directory if it does not exist.
  • We get the model using the get_model function defined in the previous snippet.
  • We define the callbacks we want to use. The first two are standard callbacks used while training neural networks. The first one is used for creating checkpoints while training. The second one stores the selected metrics (in our case, accuracy and loss) after each epoch into a CSV file. We will cover the third callback in a moment.
  • We fit the model to the training data and evaluate using the validation set.

The third callback we used, DVCLiveCallback, comes from a companion library called DVCLive. In general, it is a library that provides utilities for logging ML parameters, metrics, and other metadata in simple file formats. You can think of it as an ML logger similar to, for example, MLFlow. The biggest difference is that by using DVCLive, we do not have to use any additional services or servers. All of the logged metrics and metadata are stored as plain text files, which can be versioned with Git.

In this particular case, we used a Keras-compatible callback provided by DVCLive. DVCLive provides similar utilities for the most popular machine and deep learning libraries, such as TensorFlow, PyTorch, LightGBM, XGBoost, and more. You can find the complete list of supported libraries here. It is also worth mentioning that even though DVCLive provides many useful callbacks that we can use out-of-the-box, it does not mean this is the only way to log the metrics. We can manually log whichever metrics/plots we want at any point we want.

When we specified the DVCLiveCallback, we set the save_dvc_exp argument to True. By doing so, we indicated that we would like to automatically track the results using Git.

At this point, we are ready to run our first experiment. For that, we will use the parameters we have initially specified in the params.yaml file. To run the experiment, we can either press the Run Experiment button in the Experiments tab of the DVC panel or use the following command in the terminal:

dvc exp run

For more information on running the experiments and navigating the Experiments tab, please refer to my previous article.

After running the experiment, we notice that a new directory was created —dvclive. The DVCLive callback we used in our code automatically logged data and stored it in plain text files in that directory. In our case, the directory looks like this:

┣ 📂plots
┃ ┗ 📂metrics
┃ ┃ ┣ 📂eval
┃ ┃ ┃ ┣ 📜accuracy.tsv
┃ ┃ ┃ ┗ 📜loss.tsv
┃ ┃ ┗ 📂train
┃ ┃ ┃ ┣ 📜accuracy.tsv
┃ ┃ ┃ ┗ 📜loss.tsv
┣ 📜.gitignore
┣ 📜dvc.yaml
┣ 📜metrics.json
┗ 📜report.html

We provide a brief description of the generated files:

  • The TSV files contain the accuracy and loss over epochs, separately for the training and validation datasets.
  • metrics.json contains the requested metrics for the final epoch.
  • report.html contains plots of the tracked metrics in a form of an HTML report.

At this point, we can inspect the tracked metrics in the HTML report. However, we can also do that directly from VS Code by navigating to the Plots tab in the DVC extension.

Using the left-hand sidebar, we can select the experiments we want to visualize. I have chosen the main one, but you can see that I have already run a few experiments before. In the Plots menu, we can select which metrics we want to plot. This functionality is very handy when we track a lot of metrics, but we only want to inspect a few of them at a time.

In the main view, we can see the visualized metrics. The upper plots present the metrics calculated using the validation set, while the lower ones are based on the training set. What you cannot see in the static image is that those plots are live plots. It means that the metrics are updated after each epoch of training is completed. We can use this tab to monitor the progress of our training jobs in real-time.

For the second experiment, we increase the learning rate from 0.01 to 0.1. We can run such an experiment using the following command:

dvc exp run -S train.learning_rate=0.1

To monitor the model during training, we also selected the workspace experiment in the Experiments menu. In the image below, you can see what the plots look like while the neural network is still in the training stage (you can see that the process is running in the terminal window).

So far, all of our plots were generated in the Data Series section of the Plots tab. In total, there are three sections, each with different kinds of plots:

  • Data Series — contains visualizations of metrics stored in text files (JSON, YAML, CSV, or TSV).
  • Images — contains side-by-side visualizations of stored images, such as JPG files.
  • Trends — contains automatically generated and updated scalar metrics per epoch if DVC checkpoints are enabled.

We have already explored how to track and visualize metrics using DVCLive’s callbacks. Using DVC also allows us to track plots stored as images. For instance, we could create a bar chart representing the feature importance obtained from a certain model. Or, to simplify, we could track a confusion matrix.

The general approach to track and visualize custom plots using DVC is to create the plot manually, save it as an image, and then track it. This allows us to track any custom plot we create. Alternatively, for certain scikit-learn plots, we can use DVCLive’s log_sklearn_plot method and generate the plot using data (predictions vs. ground truth) stored in JSON files. This approach currently works for the following kinds of plots: probability calibration, confusion matrix, ROC curve, and precision-recall curve.

For this example, we will demonstrate how to start tracking a confusion matrix. In the code snippet below, you can see the modified train.py script. We have removed many things that did not change, making it easier to follow the modifications.

import os
from pathlib import Path
import numpy as np
import tensorflow as tf
from dvc.api import params_show
from dvclive.keras import DVCLiveCallback
from dvclive import Live

# data directories, parameters, datasets, and the model function did not change

def main():
model_path = BASE_DIR / "models"
model_path.mkdir(parents=True, exist_ok=True)

model = get_model()

with Live(save_dvc_exp=True) as live:

callbacks = [
model_path / "model.keras", monitor="val_accuracy", save_best_only=True

history = model.fit(

model.load_weights(str(model_path / "model.keras"))
y_pred = np.array([])
y_true = np.array([])
for x, y in validation_dataset:
y_pred = np.concatenate([y_pred, model.predict(x).flatten()])
y_true = np.concatenate([y_true, y.numpy()])

y_pred = np.where(y_pred > 0, 1, 0)

live.log_sklearn_plot("confusion_matrix", y_true, y_pred)

if __name__ == "__main__":

As you can see, this time we created an instance of a Live object, which we use both for the callback and the log_sklearn_plot method. To track all the metrics, we used a context manager (the with statement) to instantiate the Live instance. Without doing so, DVCLive would create an experiment when keras calls on_train_end. As a result, any data logged after that (in our case, the confusion matrix plot) would not be tracked within the experiment.

After modifying the training script, we ran again the two experiments with different learning rates (0.1 vs. 0.01). As a result, we can now see the confusion matrices in the Plots tab, right under the previously explored plots.

The last thing to mention is that running the modified training script also modifies the dvc.yaml pipeline within the dvclive directory. As you can see below, it now contains information about the tracked confusion matrix, such as how to build it, which template to use, and what labels to use.

- metrics.json
- plots/metrics
- plots/sklearn/confusion_matrix.json:
template: confusion
x: actual
y: predicted
title: Confusion Matrix
x_label: True Label
y_label: Predicted Label

Wrapping up

In the previous article of the series, we showed how to start using DVC and the dedicated VS Code extension to turn your IDE into an ML experimentation platform. In this part, we continued where we left off and we explored various (live-) plotting capabilities of the extension. Using those, we can easily evaluate and compare experiments to choose the best one.

In my opinion, there are two significant advantages of using a DVC-enhanced workflow. First, we do not need any external services or setups to start our experiments. The only requirement is a Git repo. Furthermore, DVC works with Git in a clean way. Although every experiment is saved in a Git commit, those commits are hidden and do not clutter our repository. In fact, we do not even need to create separate branches.

Secondly, everything happens within our IDE, enabling us to focus on our project without constantly switching between the IDE, browser, and other tools. This way, we can avoid distractions and the ever-threatening context-switching.

As always, any constructive feedback is more than welcome. You can reach out to me on Twitter or in the comments. You can find all the code used for this article in this repository.

Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!

You might also be interested in one of the following:



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: