How to Run Machine Learning Hyperparameter Optimization in the Cloud — Part 2

https://miro.medium.com/max/1200/0*rmqEWKY_9kcsj9lc

Original Source Here

How to Run Machine Learning Hyperparameter Optimization in the Cloud — Part 2

Photo by Davide Cantelli on Unsplash

This is the second part of a three-part post on the topic of hyperparameter tuning (HPT) machine learning models in the cloud. In part 1 we set the stage by introducing the problem and defining a toy model that we will use in our tuning demonstrations. In this part we will review two options for cloud based optimization, both of which involve parallel experimentation on a dedicated tuning cluster.

Option 1: HPT on a Cluster of Cloud Instances

The first option we consider for performing HPT in the cloud is based on a cluster of cloud instances. There are literally dozens of different ways to set up an instance cluster. For example, to create a cluster on AWS you could: 1) simply launch the number of Amazon EC2 instances you desire via the EC2 console, 2) build and manage a cluster using a container orchestration framework such as Kubernetes, 3) use Amazon’s managed Kubernetes service, Amazon EKS, etc.

The primary advantage of this HPT option is the flexibility it provides. When you launch your own cluster, you are pretty much free to set it up any way you want. This implies the freedom to perform HPT in any way you want — using any framework, any algorithm, any feature (such as multiple trials per instance), any auto-scaling mechanism, etc. In particular, you can set up your cluster in such a way that the head node is run on a reliable non-spot (non-preemtible) instance, but all worker nodes are run on spot instances to reduce cost.

The price of this flexibility and freedom is the effort that is required in the setup and maintenance of this option. Launching, configuring, and managing clusters requires a certain degree of expertise. Organizations that choose this path will usually have a (devops) team fully dedicated to cluster management. Some organizations may have guidelines in place (e.g. for security reasons) that you will need to align to. This may further complicate the use of this type of solution or introduce limitations on its aforementioned freedoms.

In this post we will demonstrate cluster creation using the Ray framework. Ray includes built-in support for launching a cluster on AWS. To launch a cluster for HPT we placed the following cluster configuration in a tune.yaml YAML file.

cluster_name: hpt-cluster
provider: {type: aws, region: us-east-1 }
auth: {ssh_user: ubuntu}
min_workers: 0
max_workers: 7
available_node_types:
head_node:
node_config: {InstanceType: g4dn.xlarge,
ImageId: ami-093e10b196d7cc7f0}
worker_nodes:
node_config: {InstanceType: g4dn.xlarge,
ImageId: ami-093e10b196d7cc7f0}
head_node_type: head_node
setup_commands:
- echo 'export
PATH="$HOME/anaconda3/envs/pytorch_p39/bin:$PATH"' >>
~/.bashrc
- conda activate pytorch_p39 &&
pip install "ray[tune]" "ray[air]" &&
pip install mup transformers evaluate datasets

This YAML file defines an HPT environment with up to eight Amazon EC2 g4dn.xlarge instances, each with version 65.3 of the AWS Deep Learning AMI, preconfigured to use a dedicated PyTorch conda environment, and with all Python dependencies preinstalled. (For simplicity, we have omitted the Python package versions from the script.)

The command for launching this cluster is:

ray up tune.yaml -y

Assuming all AWS account settings are configured appropriately, this will create the cluster head node which we will connect to when running our HPT. In the code block below, we demonstrate how we use the Ray Tune library to run HPT. Here we chose to use the ASHA scheduling algorithm with random parameter search. The algorithm is configured to run a total of 32 experiments, each for a maximum of eight epochs, and each with a different candidate for the optimizer learning rate. Up to eight parallel experiments can run at a time. The experiments will be measured by the reported evaluation accuracy. Underperforming experiments will be terminated early according to ASHA’s early stopping algorithm.

def hpt():
from ray import tune

# define search space
config = {
"lr": tune.loguniform(1e-6, 1e-1),
}
# define algorithm
from ray.tune.schedulers import ASHAScheduler
scheduler = ASHAScheduler(
max_t=8,
grace_period=1,
reduction_factor=2,
metric="accuracy",
mode="max")
gpus_per_trial = 1 if torch.cuda.is_available() else 0
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(train), # our train function
resources={"cpu": 4, "gpu": gpus_per_trial}),
tune_config=tune.TuneConfig(num_samples=32,
max_concurrent_trials=8,
scheduler=scheduler,
),
param_space=config,
)
# tune
results = tuner.fit()
best_result = results.get_best_result("accuracy", "max")
print("Best trial config: {}".format(best_result.config))
print("Best final validation accuracy: {}".format(
best_result.metrics["accuracy"]))
if __name__ == "__main__":
import ray
ray.init()
hpt()

The final ingredient is the session reporter. For Ray Tune to be able to track the progress of the experiments, we add the following callback to the list of HuggingFace Trainer callbacks:

from transformers import TrainerCallback
class RayReport(TrainerCallback):
def on_evaluate(self, args, state, control, metrics, **kwargs):
from ray.air import session
session.report({"loss": metrics['eval_loss'],
"accuracy": metrics['eval_accuracy']})

The following command will trigger the HPT job:

ray submit tune.yaml train.py

To support the demand for eight parallel experiments, the cluster auto-scaler will start up seven additional (worker) nodes. The HPT job will run a total of 32 experiments, some of which may be stopped early according to ASHA’s early stopping algorithm.

The following command will stop the cluster (though it will not terminate the EC2 instances):

ray down tune.yaml -y

Please do not be fooled by the relative simplicity of the flow we have described here. In practice, as mentioned above, appropriate configuration of your cloud environment, appropriate configuration of your YAML file, and appropriate management of your resultant cluster, can be quite tricky. The rest of the cloud based HPT solutions we will review will bypass these complexities by using a high level cloud training service for managing the compute instances.

Results

Our HPT job ran for roughly 30 minutes and produced the following results:

Total run time: 1551.68 seconds (1551.47 seconds for the tuning loop).
Best trial config: {'lr': 2.393250830770165e-05}
Best final validation accuracy: 0.7669172932330827

Option 2: HPT Inside a Managed Training Environment

Specialized cloud training services, such as Amazon SageMaker, offer many conveniences for machine learning model development. In addition to simplifying the process of launching and managing training instances, they may include compelling features such as accelerated data input streaming, distributed training APIs, advanced monitoring tools, and more. These properties make managed training environments the solution of choice for many machine learning development teams. The challenge is how to extend these environments to support HPT. In this section we will introduce the first of three proposed solutions for this. The first method will demonstrate how to run a Ray Tune HPT job within an Amazon SageMaker training environment. Contrary to the previous method, in which we explicitly defined and launched the instance cluster, here we will rely on Amazon SageMaker to do this. While the previous method included an auto-scaler for scaling the cluster up and down based on the HPT scheduling algorithm, in this method we create a fixed-size instance cluster. We will use the same Ray Tune based solution and the same hpt() function that we defined above and modify the entry point to set up a Ray cluster on the EC2 cluster that was launched by the managed service:

if __name__ == "__main__":
# utility for identifying the head node
def get_node_rank() -> int:
import json, os
cluster_inf = json.loads(os.environ.get('SM_RESOURCE_CONFIG'))
return cluster_inf['hosts'].index(cluster_inf['current_host'])
# utility for finding the hostname of the head node
def get_master() -> str:
import json, os
cluster_inf = json.loads(os.environ.get('SM_RESOURCE_CONFIG'))
return cluster_inf['hosts'][0]
if get_node_rank() == 0:
# the head node starts a ray cluster and starts the hpt
import subprocess
p = subprocess.Popen('ray start --head --port=6379',
shell=True).wait()
import ray
ray.init()
hpt()
else:
# worker nodes attach to the hpt cluster
import time
import subprocess
p = subprocess.Popen(
f"ray start --address='{get_master()}:6379'",
shell=True).wait()
import ray
ray.init()
try:
# keep node alive until the hpt process on head node completes
while ray.is_initialized():
time.sleep(10)
except:
pass

The code block below demonstrates how to set up an Amazon SageMaker training job with the resources required for our HPT script. Rather than starting up eight single-GPU instances to support eight parallel experiments, we request two four-GPU instances in order to demonstrate our ability to run multiple (four) experiments per instance. This is not possible when using either of the next two HPT methods we will discuss. We also align the resources per instance to 12 CPUs (instead of just 4).

from sagemaker.pytorch import PyTorch 
estimator=PyTorch(
entry_point='train.py',
source_dir='./' #contains train.py and requirements file
role=<role>,
instance_type='ml.g4dn.12xlarge', # 4 gpus
instance_count=2,
py_version='py38',
pytorch_version='1.12')
estimator.fit()

The source_dir should point to your local directory containing the train.py script and a requirements.txt file with all Python package dependencies:

ray[air]
ray[tune]
mup==1.0.0
transformers==4.23.1
datasets==2.6.1
evaluate==0.3.0

Pros and Cons

This method of tuning shares a lot of the advantages of the cluster based method from the previous section — we are pretty much free to run any HPT framework and any HPT algorithm that we desire. The main disadvantage of this approach is its lack of auto-scalability. The number of cluster instances needs to be determined upfront and remains constant throughout the duration of the training job. If we use an HPT algorithm in which the number of parallel experiments changes during the course of tuning, we may find some of our (expensive) resources lying idle for periods of time.

An additional limitation has to do with the ability to use discounted spot instances. While Amazon SageMaker supports recovery from spot interruption, as of the time of this writing, the spot configuration is applied to all of the instances in the cluster. You do not have the option of configuring the head node to be persistent and just the worker nodes to be spot instances. Moreover, a spot interruption of a just a single instance will trigger a restart of the entire cluster. While this does not completely prohibit the use of spot instances, it does make it much more complicated.

Results

The results of our HPT inside Amazon SageMaker run are summarized in the block below:

Total run time: 893.30 seconds (893.06 seconds for the tuning loop).
Best trial config: {'lr': 3.8933781751481333e-05}
Best final validation accuracy: 0.7894736842105263

Up Next

The final part of our three-part post will explore two additional methods for cloud based HPT, using a managed HPT service and wrapping managed training jobs in an HPT solution.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: