Original Source Here
I guess the problem is obvious and you probably experienced it yourself. You want to train a deep learning model and you want to take advantage of multiple GPUs, a TPU or even multiple workers for some extra speed or larger batch size. But of course you cannot (let’s say should not because I’ve seen it quite often 😅) block the usually shared hardware for debugging or even spend a ton of money on a paid cloud instance.
Let me tell you, it is not important how many physical GPUs your system has but rather how many your software thinks it does have. The keyword is: (device) virtualization.
Let’s implement it
First lets have a look on how you would usually detect and connect to your GPU:
You would first list all devices available, then select a suitable strategy and the initialize your model, optimizer and checkpoint within the scope of the strategy. If you would use a standard training loop with model.fit() you would be done. If you would use a custom training loop you would need to implement some extra steps.
Check out my tutorial on Accelerated Distributed Training with TensorFlow on Google’s TPU for more details an distributed training with custom training loops.
There is one important detail in the code above. Did you noticed I used the function list_logical_devices(“GPU”) rather then list_physical_devices(“GPU”)? Logical devices are all devices visible to the software but these are not always associated with an actual physical device. If we run the code block right now this could be an output you would see:
We will use the logical device definition to our advantage and define some logical devices, before we list all logical devices and connect to them. To be precise, we will define 4 logical GPUs associated with a single physical GPU. This is how it is done:
If we would again print the number of logical vs. physical devices you’ll see:
And voilà, you can now test your code on a single GPU as if you would be performing distributed training on 4 GPUs.
There are several things to keep in mind:
- You are not actually performing distributed training, hence there is now performance gain through parallelization
- You need to assign the logical devices, before you connect to your hardware, otherwise an exception is raised.
- It only tests the correct implementation of your algorithm and you can check if the output shapes and values are as expected. It will not guarantee that all drivers and hardware in a multi-GPU setup is correct.
Let me know in the comments if this trick is useful for you and if you already knew about this feature! For me it was a game changer.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot