A Step-by-Step Guide to Tune a Model on Google Cloud’s Vertex AI

Original Source Here

A Step-by-Step Guide to Tune a Model on Google Cloud’s Vertex AI

Tuning model hyperparameters, and visualizing metrics on managed Tensorboard

Vertex AI (Source: Google Cloud)


In the previous article (first one of this series), we walked through the step-by-step instructions to have the first model trained on Vertex AI, Google Cloud’s newest integrated machine learning platform. The problem we were solving was an image classification task for the CIFAR10 dataset, which contains 60,000 32 x 32 images of ten classes.

In this article, we’ll build on that to improve the model performance and explore two very cool tools on Vertex AI: Hypertune and Experiments.

Optimization Idea

An observation from the previous article was that training for more epochs yielded better results. When training for five epochs locally, we got 60% evaluation accuracy. With 15 epochs on Vertex AI, we obtained 66% evaluation accuracy. Note that it’s usually better to use precision and recall as the performance metrics. But we are dealing with a perfectly balanced dataset. So we will stick to accuracy for simplicity.

The simplest and most brute-force idea is to train for more epochs. We set the epoch argument to 50 and launched a training job (refer to the previous article for how to do that). Unfortunately, the result was disappointing. The evaluation accuracy was 64%, which was lower than the result from the 15 epochs training. At the same time, the training accuracy shot up to 93%. Clearly, the model was overfitting the training data. So we’ll need some kind of regularization to generalize the model.

There are many regularization techniques available. The simplest is probably Dropout, which randomly shuts off some neurons for each training batch, in an attempt to force the model to learn more robust features. We’ll use Dropout to improve our model.


Now we’ve decided to use Dropout, the immediate follow-up question is the dropout rate, which is the percentage of neurons the model will turn off for each training batch. The dropout rate is a hyperparameter. Many other configs are hyperparameters too, such as learning rate, the number of neurons in a particular layer, number of layers, activation functions, etc. For demo purposes, let’s focus on the dropout rate. And for simplicity, we only add one Dropout layer after the flattened layer. See the following code snippet for implementation details. The model is the same as the one we used in the previous article. The only added element here is the Dropout layer.

Hypertune code

As you can see, the dropout rate is designed to be an argument that we can pass to the training job. Our goal here is to find the optimal dropout rate that yields the highest validation accuracy. Obviously, you can manually vary the dropout rate by launching multiple training jobs with the dropout rates of your best guess. But why do something repetitive and boring when you can rely on automation?

Hypertune is here to rescue. It does exactly that — it launches multiple training jobs with various hyperparameter values (the dropout rate in our case) from the ranges we specify. Hypertune monitors the model performance metrics, which we’ll need to expose, and searches in the allowed space to maximize the metrics. Hypertune supports the typical search algorithms, and the default is Bayesian hyperparameter optimization.

Coming back to the code, we’ve already parameterized the dropout rate. Now we need to update the model to expose metrics. The metric we use is validation accuracy, calculated on the validation dataset at the end of each epoch. We’ll implement the metrics reporting using a custom Tensorflow callback, which calls the Hypertune Python library to report the metrics. The Hypertune library essentially just dumps the metrics in some structured format to a temporary folder on the host machine, which will be picked up by the Hypertune service.

Hypertune metrics code

Note that we purposely ignore the checkpointing part so that we don’t need to arrange different Google Cloud Storage folders for different tuning trials. If we’re unlucky to get host preemption, so be it. It’ll just takes a bit longer to start that particular training trial from scratch.

Now we’re ready to launch a hyperparameter tuning job. Similar to the previous article, we’ll use the gcloud command line tool.

gcloud beta ai hp-tuning-jobs create \
--display-name=e2e-tutorial-hpt --region=us-central1 \
--config=hpt.yaml --max-trial-count=10 \

The command arguments are all very self-explanatory. The meat is in the config file:

Hypertune config

The workerPoolSpec under the trialJobSepc is identical to the config we used to launch training jobs in the previous article. The newly added field here is the studySpec, which contains the target metrics and the search space of the hyperparameter in question.

The hp-tuning-jobs create command will return a job ID, which we can use to query the hyperparameter tuning status.

gcloud beta ai hp-tuning-jobs describe JOB_ID --region=us-central1

We can also visit the tuning job on UI. Go to Vertex AI -> Training -> Hyperparameter Tuning, and click on the tuning job to inspect the details.

Hypertune results

It looks like it has finished, and a dropout rate of 0.45 yielded the highest validation accuracy.

Experiment (Managed Tensorboard)

We could just take the model of the highest validation accuracy from the hyperparameter tuning job and call it a day. But we want to try out the Experiments feature. So we’ll train the model one last time with the optimal dropout rate and hook up the Experiments tool (managed Tensorboard) to visualize real-time model training.

To use Experiment, the first thing we should do is create a Tensorboard instance.

gcloud beta ai tensorboards create \
--display-name=e2e-tutorial-viz --region=us-central1

We can see a Tensorboard instance is created by going to UI: Vertex AI -> Experiments -> Tensorboard Instances.

Tensorboard instance

Next we’ll add a Tensorboard callback to our model training so that it exports the data needed for Tensorboard visualization.

Tensorboard Metrics Code

There is an AIP_TENSORBOARD_LOG_DIR environment variable involved, which we’ll explain later.

One more thing before launching the training job: We need to create a service account and configure its permissions for Tensorboard visualization. You can follow this link to set it up. I know I said in the first article that I’d try to make the articles as self-contained as possible. But this permission setup is very mechanical and the instruction is all in one place, so I don’t feel the need for duplicating them.

Now we can launch the training job. Instead of using the gcloud command this time, we’ll send a curl request directly, which is what the gcloud command does under the hood. That’s because there seems to be a argument parsing bug in the gcloud command that prevents creating a training job for Tensorboard visualization.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-X POST -d @request.json \

The core content here is in the request.json file:

Tensorboard Request

A couple things to note: First of all, the GCS_PATH_FOR_PYTHON_CODE is where our package Python distribution lives. Refer to the first article for how we package and store the training code in Google Cloud Storage. Secondly, the GCS_PATH_FOR_TENSORBOARD_LOG is a Google Cloud Storage location for the Tensorboard log. The training host set the aforementioned AIP_TENSORBOARD_LOG_DIR environment variable to the value of GCS_PATH_FOR_TENSORBOARD_LOGso that it’s accessible in our training code. Thirdly, the SERVICE_ACCOUNT_FROM_THE_PERMISSION_STEP is the service account that you’ll have created if you follow the permission setup mentioned earlier. Lastly, the FULL_RESOURCE_NAME_OF_TENSORBOARD is the full resource name of the Tensorboard instance, which we can obtain by the following command:

gcloud beta ai tensorboards list --region=us-central1

Now we’re all set. Just wait for the training job to start, then navigate to the UI: Vertex AI -> Experiments -> Experiments and click on the Open Tensorboard button, which opens the Tensorboard visualization for our training job on a new tab.

Tensorboard entry
Tensorboard visualization

Wrap Up

Finally, let’s evaluate the model on the test dataset (the code is in the first article). The evaluation accuracy is 74%. From the initial 60%, we’ve gone a long way. Obviously, there is still room for improvement. But this journey has served us well in demonstrating the capability of Vertex AI on training custom models. We’ll take a break here and reconvene in the next episode of this series.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: