Implement Multi-GPU Training on a single GPU

https://miro.medium.com/v2/resize:fit:1200/0*fRR_0gpsFNCKoboz

Original Source Here

Motivation

I guess the problem is obvious and you probably experienced it yourself. You want to train a deep learning model and you want to take advantage of multiple GPUs, a TPU or even multiple workers for some extra speed or larger batch size. But of course you cannot (let’s say should not because I’ve seen it quite often 😅) block the usually shared hardware for debugging or even spend a ton of money on a paid cloud instance.

Let me tell you, it is not important how many physical GPUs your system has but rather how many your software thinks it does have. The keyword is: (device) virtualization.

Let’s implement it

First lets have a look on how you would usually detect and connect to your GPU:

Code 1: Detect all available GPUs, initialize the respective scope and initialize your model, optimizer and checkpoints within the scope of the strategy.

You would first list all devices available, then select a suitable strategy and the initialize your model, optimizer and checkpoint within the scope of the strategy. If you would use a standard training loop with model.fit() you would be done. If you would use a custom training loop you would need to implement some extra steps.

Check out my tutorial on Accelerated Distributed Training with TensorFlow on Google’s TPU for more details an distributed training with custom training loops.

There is one important detail in the code above. Did you noticed I used the function list_logical_devices(“GPU”) rather then list_physical_devices(“GPU”)? Logical devices are all devices visible to the software but these are not always associated with an actual physical device. If we run the code block right now this could be an output you would see:

Figure 1: Screenshot of output after running Code 1. and connecting to a single logical GPU with one associated physical GPU. Taken by author.

We will use the logical device definition to our advantage and define some logical devices, before we list all logical devices and connect to them. To be precise, we will define 4 logical GPUs associated with a single physical GPU. This is how it is done:

Code 2: Create multiple logical GPU devices associated with a single physical GPU.

If we would again print the number of logical vs. physical devices you’ll see:

Figure 2: Screenshot of output after running Code 2. before Code 1. and connecting to four logical GPU with one associated physical GPU. Taken by author.

And voilà, you can now test your code on a single GPU as if you would be performing distributed training on 4 GPUs.

There are several things to keep in mind:

  1. You are not actually performing distributed training, hence there is now performance gain through parallelization
  2. You need to assign the logical devices, before you connect to your hardware, otherwise an exception is raised.
  3. It only tests the correct implementation of your algorithm and you can check if the output shapes and values are as expected. It will not guarantee that all drivers and hardware in a multi-GPU setup is correct.

Let me know in the comments if this trick is useful for you and if you already knew about this feature! For me it was a game changer.

Happy testing!💪

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: