Original Source Here
How to Run Machine Learning Hyperparameter Optimization in the Cloud — Part 1
This is the first of a three-part post in which we will explore some of the ways in which hyperparameter optimization can be performed in a cloud machine learning (ML) environment.
Hyperparameter Tuning (HPT): HPT or hyperparameter optimization (HPO) refers to the art of finding the most optimal set of hyperparameters (e.g. learning rate, momentum, dropout, etc.) for your machine learning model. HPT is an essential part of any machine learning project; applying it appropriately can determine whether your project succeeds or fails. A lot has been written on the art of HPT, e.g. see here for a brief survey on some of the methods for performing HPT and here for a survey of some of the existing frameworks that support HPT. A common method for accelerating HPT is by scaling up the number of host machines in order to increase the number of experiments that are being run in parallel.
HPT in the Cloud: In previous posts (e.g. here) we have expanded on the advantages of performing machine learning in the cloud. In particular, we have noted the virtually infinite scaling capacity of cloud based ML. This property facilitates significant acceleration of our ML projects by enabling us to start up as many training instances and as many parallel experiments as we desire (or can afford). The scalability property makes the cloud an ideal playground for HPT of our ML models.
In this post we review some of the ways to run HPT in the cloud. The way to do this is not always immediately obvious. This is due to the fact that HPT typically involves performing multiple trials that must be coordinated. Progress reports need to be collected from all the trials, and appropriate actions need to be taken in accordance with the HPT algorithm of choice. Such coordination is pretty trivial when running in a local environment in which you have full control of your training instances and can easily set them up to communicate with one another. But it is less obvious in the cloud, especially when using a managed training service, such as Amazon SageMaker, where the underlying instance setup and configuration is delegated to the service.
Different Methods for Cloud Based HPT
In previous posts (e.g. here) we have noted the wide range of cloud based machine learning solutions that developers have to choose from. Cloud Service Providers (CSPs) such as GCP, AWS, and Azure, offer a variety of ML training options at different levels of abstraction. On one end of the spectrum, developers can request a “bare-metal” GPU cloud instance and handle all elements of the setup, configuration, and application flow on their own. On the other end of the spectrum, you have highly specialized, fully managed, cloud based training frameworks. Naturally, the plethora of cloud based training options extends to multiple ways of performing cloud based HPT. In this post we will review and demonstrate four options:
- HPT on a cluster of cloud instances — In this scenario we design a solution tailored for HPT experimentation around cloud compute instances. This option typically requires the most setup and maintenance but enables the most customization.
- HPT inside a managed training environment — Here we rely on a managed training service to build a cluster of instances and run the HPT within this cluster.
- Managed HPT services — Some CSPs offer dedicated APIs for HPT experimentation.
- Wrapping managed training experiments with HPT — In this scenario, the HPT algorithm runs locally (or on a cloud notebook instance) and each HPT experiment is an independently spawned cloud based training job.
Points of Comparison
There are many properties that can be used as a basis of comparison between HPT solutions. In this post we have chosen just a few of the attributes that highlight the strengths and weaknesses of the cloud based solutions we will discuss. These include:
- Algorithm Support: HPT algorithm development for deep learning is an active area of research with new algorithms coming out all the time. There are generally two parts to an HPT algorithm; the parameter search algorithm and the scheduling algorithm. The search algorithm determines how to assign values to the set of parameters from a predefined search space. Examples of search algorithms include trivial methods such as grid search and random search, and more sophisticated (e.g., Bayesian) methods that involve some form of exploitation, i.e., learning from previous results. Scheduling algorithms control how and when to run experiments, how many to run in parallel, how to determine which experiments to terminate early, etc. Ideally, we would have complete freedom in choosing our HPT algorithm. In practice, certain solutions may restrict the use of some algorithms, either explicitly (e.g., through their API) or implicitly (e.g. ,through restrictions on the number of parallel experiments).
- Auto-scalability: Depending on the HPT algorithm you choose, you may find that different numbers of instances are used during different stages of tuning. For such scenarios, it is ideal to have an HPT solution that supports auto-scaling of the compute instances according to the need dictated by the scheduling algorithm. The alternative may entail maintaining (and paying for) idle compute instances.
- Complexity: Different HPT solutions vary in the complexity of their configuration and maintenance. Solutions based on managed cloud service offerings will usually require more effort.
- Resource Flexibility: HPT frameworks typically include options for running multiple experiments on a single compute instance. However, some cloud HPT solutions restrict the number of experiments per compute instance to one.
- Overhead of Experiment Initialization: The startup time of each new experiment will vary based on chosen HPT solution. If your experiments are relatively short this overhead will impact the overall duration (and overall cost) of your HPT.
- Spot Instance Usage: Using spot or preemptible instances for training ML models allows us to take advantage of unused cloud compute capacity at significantly discounted rates. Some HPT solutions are more spot-friendly than others and can, thus, have a meaningful impact on reducing cost.
Additional considerations — no less important — include: reproducibility, warm starting tuning, checkpointing, fault tolerance, distributed training (where each experiment runs on multiple GPUs or multiple instances), cost, and more.
For our impatient readers, here is a table summarizing our own personal views on the four options we will cover in this post.
The post will include brief demonstrations of some of the approaches we discuss in order to highlight some of their properties. These demonstrations will include some service and framework choices including Amazon SageMaker, PyTorch (1.12), Ray Tune (2.0), and Syne Tune (0.12). Please do not view these choices as an endorsement over other options. The best option for you will likely depend on a great many factors including the details of your project, cloud service costs, and more. In addition, please keep in mind that the specific APIs and usages of these frameworks may change by the time you read this post.
I would like to thank Isaac Djemal for his contributions to the post.
Toy Example — Image Classification with a Vision Transformer
In this section we will describe the toy model we will use in our HPT demonstrations. The output of the section is a training function that we will call with different initializations of the optimizer learning rate hyperparameter during HPT. Feel free to skip to the next section if you are eager to get straight to the HPT stuff.
The model we will use for the experiments is an image classification model with a vision transformer (ViT) backbone. The code block below demonstrates how to build a vision transformer model using the HuggingFace ViT API of the Python transformers package (version 4.23.1).
from transformers import (
model = ViTForImageClassification(ViTConfig(num_labels=3))
Note, that in practice we will actually tune a slightly modified version of the ViT model which uses maximal update parametrization (μP). This method enables tuning on relatively narrow versions of the ViT model and applying the results to much large versions. See the appendix at the bottom of this post for details.
# 1. build ViT model
model = build_model()
model.to(torch.cuda.current_device()) # 2. load and configure beans dataset
from datasets import load_dataset
ds = load_dataset('beans') from transformers import ViTFeatureExtractor
feature_extractor = ViTFeatureExtractor() def transform(example_batch):
inputs = feature_extractor([x for x in example_batch['image']],
inputs['labels'] = example_batch['labels']
return inputs prepared_ds = ds.with_transform(transform) def collate_fn(batch):
'pixel_values':torch.stack([x['pixel_values'] for x in batch]),
'labels': torch.tensor([x['labels'] for x in batch])
} # 3. load metric
from evaluate import load as load_metric
metric = load_metric("accuracy")
import numpy as np
references=p.label_ids) # 4. define optimizer with the configured lr value
from torch.optim import AdamW
scheduler=None # 5. define Trainer
from transformers import TrainingArguments
training_args = TrainingArguments(
) from transformers import Trainer
trainer = Trainer(
callbacks= # we will use callbacks for HPT reporting
) # 6. train
The optimizer learning rate is passed into the train function via the config dictionary. This is the hyperparameter that we will tune in our HPT demonstrations.
In the next sections of our three-part post, we will demonstrate four different options for performing HPT. In each case, we will use the training function above to run multiple training experiments with different configurations for the learning rate hyperparameter.
Appendix: Vision Transformer with Maximal Update Parameterization
The toy model we use in our experiments follows the Hyperparameter Transfer (μTransfer) solution¹ developed by researchers at Microsoft and OpenAI. μTransfer aims to address the unique challenge of performing HPT on particularly large models — models on the order of billions of parameters. Performing HPT on such models can be prohibitively expensive given the amount of machinery and time required for each individual trial. The study shows that when adopting the maximal update parametrization (μP), many of the hyperparameters that are optimal on smaller versions of the architecture remain optimal on larger versions. Thus, the strategy proposed is to perform traditional HPT on a small version of the model parameterized in μP and then transfer the learned hyperparameters to the large model. In this post we demonstrate the first step in this process on a vision transformer (ViT) based classification model: we tune the learning rate for a relatively small ViT parameterized in μP using the dedicated utilities defined in the mup Python package. The code block below demonstrates how to parameterize a HuggingFace ViT model in μP. The code follows the guidelines in the mup code base as well as the pointers provided in this code repository, dedicated to μP of HuggingFace transformer models. We have highlighted the lines of code that are unique to μP. The code uses version 1.0.0 of mup Python package and version 4.23.1 of the transformers package. Note that we include the code for the sake of completeness; a full understanding of the code is not required for the post.
from typing import Optional, Tuple, Union
from torch import nn
from transformers import (
from transformers.models.vit.modeling_vit import ViTSelfAttention
from mup import (
MuSGD, get_shapes, set_base_shapes,
make_base_shapes, MuReadout, normal_
) def mup_forward(
head_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False
key_layer.transpose(-1, -2)) ### muP: 1/d attention
attention_scores = attention_scores / self.attention_head_size attention_probs=nn.functional.softmax(attention_scores, dim=-1)
attention_probs=self.dropout(attention_probs) # Mask heads if we want to
if head_mask is not None:
attention_probs=attention_probs * head_mask context_layer=torch.matmul(attention_probs, value_layer)
context_layer=context_layer.permute(0, 2, 1, 3).contiguous()
return outputs # override forward function with mup_forward
ViTSelfAttention.forward = mup_forward class MupViTForImageClassification(ViTForImageClassification):
def __init__(self, config: ViTConfig) -> None:
self.num_labels = config.num_labels
self.vit = ViTModel(config, add_pooling_layer=False) ### muP: Classifier head - replace nn.Linear with MuReadout
if config.num_labels > 0:
self.classifier=nn.Identity() # Initialize weights and apply final processing
self.post_init() def _init_weights(
module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm],
readout_zero_init: bool = False,
query_zero_init: bool = False) -> None:
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
### muP: swap std normal init with normal_ from mup.init
if isinstance(module, MuReadout) and readout_zero_init:
if hasattr(module.weight, 'infshape'):
### End muP
if module.bias is not None:
elif isinstance(module, nn.LayerNorm):
if isinstance(module, ViTSelfAttention):
module.query.weight.data[:] = 0 base_config = ViTConfig(
intermediate_size=4 * 768,
delta_config = ViTConfig(
intermediate_size=2 * 768,
hidden_size=4 * 2 * 768,
base_model = MupViTForImageClassification(config=base_config)
delta_model = MupViTForImageClassification(config=delta_config)
base_shapes = make_base_shapes(base_model, delta_model) model = MupViTForImageClassification(ViTConfig(num_labels=3))
The optimizer in the training function should be replaced by:
# 4. define optimizer with the configured lr value
from mup.optim import MuAdamW
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot