How to Configure a GPU Cluster to Scale with PyTorch Lightning (Part 2)*jlTz3yGxTcPbtQJ-

Original Source Here

How to Configure a GPU Cluster to Scale with PyTorch Lightning (Part 2)

Scale from the local machine to the cloud. Photo by Soumil Kumar from Pexels

In part 1 of this series, we learned how PyTorch Lightning enables distributed training through organized, boilerplate-free, and hardware agnostic code.

In case you are wondering, the “trainer.x” syntax comes from our LightningCLI which enables you to add a full command-line interface to the Trainer and LightningModule with just one line of code.

In this post, we will learn how to configure a cluster to enable Lighting to scale to multiple GPU machines with a simple, ready-to-run PyTorch Lightning ImageNet example.

Thanks to Lightning, you do not need to change this code to scale from one machine to a multi-node cluster.

While Lightning supports many cluster environments out of the box, this post addresses the case in which scaling your code requires local cluster configuration.

Note: If you don’t want to manage cluster configuration yourself and just want to worry about training. You can can check out the early access feature that enables you to scale multi node training with no code changes and no requirement for any cluster configuration.

Cluster Configuration for Distributed Training with PyTorch Lightning

Photo by Brett Sayles from Pexels

Before we can launch experiments in a multi-node cluster we need to be aware of the type of cluster we are working with.

This tutorial will address the two general types of clusters:

  • Managed Clusters such as SLURM enable users to request resources and launch processes through a job scheduler.
  • General-purpose Clusters provide users with direct access to all nodes on the same network. Processes are launched by logging into each node and starting each process manually.

Configuring Managed SLURM Cluster for Lightning

SLURM is found on clusters with many users where scheduling of jobs and resources is crucial for the efficient operation of the cluster providing:

  • Queuing systems for job scheduling
  • Hardware resource allocations for jobs
  • Fair distribution of resources among users and user groups

For the managed cluster we will look at SLURM in particular since it is the most popular workload manager for large-scale clusters. If your institution maintains a SLURM cluster then chances are you have already run some jobs there.

Running Jobs on SLURM

This sections assume your system administrator has physically configured your cluster and installed the SLURM workload manager. If this is not the case check out the documentation here.

Before configuring SLURM for Lightning, we first need to confirm we can run multi-node jobs.

Save these contents to a file check.slurm and run it with the command sbatch check.slurm

The job gets queued and the job ID gets printed to the terminal. We can monitor the status of jobs with the squeue command. Our test job should quickly complete and then disappear from the queue. The output of our script, containing the hostnames of the nodes that got to run. gets written to a file in the current working directory.

If this command does not work it means that either your cluster does not have enough GPU resources or you do not have sufficient privileges to run jobs.

Submitting our Lightning Training Job

Here is the full SLURM batch script that runs our Lightning ImageNet training on two nodes using eight GPUs each:

There are 4 steps to submitting a Lightning training script with SLURM.

  1. Prepare your Lightning script as you normally would in a file.
  2. Prepare a submit.slurm batch script which contains instructions for SLURM on how to deploy your job.
  3. Submit the job by running sbatch submit.slurm
  4. Monitor the job and wait until its completion

To get started save the submission batch script to a file called submit.slurm.

The first four lines of the submission script contain the following SBATCH directives to signal the SLURM workload manager how to deploy the job on the cluster.

  • SBATCH — nodes=2 Requests two nodes from the cluster.
  • SBATCH — gres=gpu:8 Requests servers that have 8 or more GPUs inside.
  • SBATCH — ntasks-per-node=1 The number of tasks to run on each server. Important: This is set to 1 and does not correspond to the number of processes/GPUs per node because launching the individual processes will be done by Lightning on each node separately.
  • SBATCH — time=02:00:00 The maximum time we expect the job to run, in hh:mm:ss format. The job will terminate after the time limit is reached.

Note: The # symbol in front of the SBATCH directives is not for indicating a comment, it is required and should not be removed.

Now, we can submit the job using the command

sbatch submit.slurm.

This is all we have to do in order to run on SLURM, but here are additional SLURM options you can turn on. Check out the useful sample scripts on this Wiki page.

Configuring a General Purpose Cluster for Lightning

If SLURM is not available, you can still configure multi-node experiments as with two or more GPU servers on the same network as long as:

  1. All server nodes have known IP addresses. In this example, we assume we have two servers with IPs and We select the first server as the head or PyTorch master node.
  2. The PyTorch master node needs an open port for incoming and outgoing TCP/IP traffic. This can be configured in the firewall of your operating system. In this example, we assume the open port number is 1234. You may have to ask the administrator of your server to do it for you.
  3. Code is accessible on each node through a shared filesystem

Once these conditions are met, log into each node on the network starting with the PyTorch master node and manually launch the training script with information about its relevant rank and PyTorch master node address and port.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: