Distributed Training On Multiple GPUs



Original Source Here

Distributed Training On Multiple GPUs

With the continuous increase of data scale and model size, it will take a lot of time to train, especially for a single GPU. Therefore it is necessary to use multiple GPUs to speed up training processing.

In this blog post, I will introduce the theoretical basis for distributed training firstly and then go into two ways nn.DataParalleland nn.DistributedDataParalllel to implement distributed training in multiple GPUs with PyTorch.

Theoretical Basis

Training Loop

First, let’s go over how training a neural network usually works. There are four main steps for each loop that happens when training a neural network:

Images created by HuggingFace
  1. The forward pass, where the input is processed by the neural network;
  2. The loss function is calculated, comparing the predicted label with the ground-truth label;
  3. The backward pass is done, calculating the gradients for each parameter based on the loss (using back-propagation); loss.backward()
  4. The parameters are updated using the gradients. optimizer.step()

Model Parallelism & Data Parallelism

  • Model Parallelism: different GPUs input the same data and run different parts of the model, such as different layers of a multi-layer network;
  • Data Parallelism: different GPUs input different data and run the same complete model.

When the model is very large and one GPU can no longer be stored, Model Parallel can be used to assign different parts of the model to different machines, but this will bring a lot of communication overhead. Moreover, the parallel parts of the model have certain dependencies, and the scale is poor. Therefore, when one model can be placed on one model, Data Parallelism is usually adopted, and each part is independent and has good scalability.

Synchronous Update & Asynchronous Update

For data parallelism, since each GPU is responsible for part of the data, it involves the issue of updating parameters, which can be divided into two methods: synchronous update and asynchronous update.

  • Synchronous Update: After all GPU calculations in each batch are completed, the new weights are calculated uniformly, and then all GPUs synchronize the new values, and then the next round of calculations is performed.
  • Asynchronous Update: After each GPU calculates the gradient, there is no need to wait for other updates, and the overall weight is immediately updated and synchronized.

Synchronous Updates have a waiting time, and the speed depends on the slowest GPU; Asynchronous Updates have no waiting time, but when it comes to more complex gradients, there is a problem of loss drop and large jitter. Therefore, in practice, the method of synchronous update is generally used.

Parameter Server & Ring AllReduce

Suppose we have N GPUs:

  • Parameter Server: GPU 0 (as Reducer) divides the data into five parts and distributes it to each GPU. Each GPU is responsible for its own mini-batch training. After getting the gradients, return it to GPU 0 for accumulation and get the updated weight parameters. Then distribute to each GPU.
  • Ring AllReduce: N GPUs are connected in a ring. Each GPU has a left-hand channel and a right-hand channel. One is responsible for receiving and the other is responsible for sending. The gradient accumulation is completed by looping (N-1) times (Scatter Reduce), and the parameters are synchronized updated again by looping (N-1) times (All Gather).

Parameter Server has two disadvantages

  1. Each round of training iteration requires all GPUs to synchronize the data and do a Reduce before it ends. When there are many parallel GPUs, the barrel effect will be very serious and the calculation efficiency will be low.
  2. All GPUs need to communicate with the Reducer for data, gradients, and parameters. When the model is large or the data is large, the communication overhead is high.

Assuming that there are N GPUs and the time required to communicate a complete parameter is K, then using the PS architecture, the communication cost spent is T = 2(N-1)K.

The biggest problem with Parameter Service is that the communication cost is linearly related to the number of GPUs. The communication cost of Ring AllReduce has nothing to do with the number of GPUs. Ring AllReduce is divided into two steps: Scatter Reduce and All Gather.

  • Scatter Reduce: First, we divide the parameters into N parts, and the adjacent GPUs pass different parameters. After passing N-1 times, the accumulation of each part of the parameters (on different GPUs) can be obtained.
Scatter Reduce
  • All Gather: After getting the accumulation of each parameter, make another pass and synchronize it to all GPUs.
All Gather

According to these two processes, we can calculate the communication cost of Ring AllReduce as T = 2(N-1)K/N.

Examples with PyTorch

  • DataParallel (DP): Parameter Server mode, one GPU is a reducer, the implementation is also super simple, one line of code.
  • DistributedDataParallel (DDP): All-Reduce mode, originally intended for distributed training, but can also be used for single-machine multi-GPUs.

DataParallel

if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# only one line of code
model = nn.DataParallel(model)

Run Command:

python train.py

DistributedDataParallel

'''Only five steps'''# 1) Initialize the backend of computationtorch.distributed.init_process_group(backend="nccl")# 2) Configure the gpu of each processlocal_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
# 3)Use DistributedSampler to distribute data to each gpu from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset)
data_loader = DataLoader(dataset=dataset,
batch_size=batch_size,
sampler=sampler)
# 4Move the model to each gpumodel.to(device)# 5Wrap up modelif torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[local_rank],output_device=local_rank)

Run Command:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py

Reference

https://towardsdatascience.com/how-to-scale-training-on-multiple-gpus-dae1041f49d2

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html?highlight=dataparallel#torch.nn.DataParallel

https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html#distributeddataparallel

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: