Original Source Here
How to Configure a GPU Cluster to Scale with PyTorch Lightning (Part 2)
In this post, we will learn how to configure a cluster to enable Lighting to scale to multiple GPU machines with a simple, ready-to-run PyTorch Lightning ImageNet example.
Thanks to Lightning, you do not need to change this code to scale from one machine to a multi-node cluster.
While Lightning supports many cluster environments out of the box, this post addresses the case in which scaling your code requires local cluster configuration.
Note: If you don’t want to manage cluster configuration yourself and just want to worry about training. You can can check out the early access Grid.ai feature that enables you to scale multi node training with no code changes and no requirement for any cluster configuration.
Cluster Configuration for Distributed Training with PyTorch Lightning
Before we can launch experiments in a multi-node cluster we need to be aware of the type of cluster we are working with.
This tutorial will address the two general types of clusters:
- Managed Clusters such as SLURM enable users to request resources and launch processes through a job scheduler.
- General-purpose Clusters provide users with direct access to all nodes on the same network. Processes are launched by logging into each node and starting each process manually.
Configuring Managed SLURM Cluster for Lightning
SLURM is found on clusters with many users where scheduling of jobs and resources is crucial for the efficient operation of the cluster providing:
- Queuing systems for job scheduling
- Hardware resource allocations for jobs
- Fair distribution of resources among users and user groups
For the managed cluster we will look at SLURM in particular since it is the most popular workload manager for large-scale clusters. If your institution maintains a SLURM cluster then chances are you have already run some jobs there.
Running Jobs on SLURM
Before configuring SLURM for Lightning, we first need to confirm we can run multi-node jobs.
The job gets queued and the job ID gets printed to the terminal. We can monitor the status of jobs with the
squeue command. Our test job should quickly complete and then disappear from the queue. The output of our script, containing the hostnames of the nodes that got to run. gets written to a file in the current working directory.
If this command does not work it means that either your cluster does not have enough GPU resources or you do not have sufficient privileges to run jobs.
Submitting our Lightning Training Job
There are 4 steps to submitting a Lightning training script with SLURM.
- Prepare your Lightning script as you normally would in a
- Prepare a
submit.slurmbatch script which contains instructions for SLURM on how to deploy your job.
- Submit the job by running
- Monitor the job and wait until its completion
To get started save the submission batch script to a file called
The first four lines of the submission script contain the following SBATCH directives to signal the SLURM workload manager how to deploy the job on the cluster.
- SBATCH — nodes=2 Requests two nodes from the cluster.
- SBATCH — gres=gpu:8 Requests servers that have 8 or more GPUs inside.
- SBATCH — ntasks-per-node=1 The number of tasks to run on each server. Important: This is set to 1 and does not correspond to the number of processes/GPUs per node because launching the individual processes will be done by Lightning on each node separately.
- SBATCH — time=02:00:00 The maximum time we expect the job to run, in hh:mm:ss format. The job will terminate after the time limit is reached.
Note: The # symbol in front of the SBATCH directives is not for indicating a comment, it is required and should not be removed.
Now, we can submit the job using the command
This is all we have to do in order to run on SLURM, but here are additional SLURM options you can turn on. Check out the useful sample scripts on this Wiki page.
Configuring a General Purpose Cluster for Lightning
If SLURM is not available, you can still configure multi-node experiments as with two or more GPU servers on the same network as long as:
- All server nodes have known IP addresses. In this example, we assume we have two servers with IPs 10.10.10.1 and 10.10.10.2. We select the first server as the head or PyTorch master node.
- The PyTorch master node needs an open port for incoming and outgoing TCP/IP traffic. This can be configured in the firewall of your operating system. In this example, we assume the open port number is 1234. You may have to ask the administrator of your server to do it for you.
- Code is accessible on each node through a shared filesystem
Once these conditions are met, log into each node on the network starting with the PyTorch master node and manually launch the training script with information about its relevant rank and PyTorch master node address and port.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot