Original Source Here
Training node — The smallest entity of scale
GPUs consist of smaller sets of elements that are copied across the chip. These smaller sets are the training nodes. They hold the different parts needed to make the bulk of the computations — the arithmetic and logic unit, together with the controlling unit, SRAM memory, and other components.
Ganesh Venkataramanan, Director of Project Dojo, called the training node “the smallest entity of scale.” It’s the smallest component that’s further scaled by placing exact copies in every direction. In particular, 354 connected training nodes make a chip, 25 connected chips make a training tile, 12 training tiles make a cabinet, and 10 cabinets make the ExaPOD. By scaling these elements all the way from the training nodes it’s possible to reach a computing performance up to the EFLOP — but some limitations need to be solved to achieve such a feat.
In particular, there’s the question of what size to make the training node. Too small makes it fast but too costly to synchronize. Too big makes it difficult to implement and can produce “memory bottlenecks.” Because they wanted to keep the latency low, they designed the training node measuring the farthest distance that a high clock cycle signal (+2GHz) can traverse for 1 cycle (lowest latency) and drew a box around it to define the size of the node. And because they wanted to keep the bandwidth high, they filled the box with wires “to the brink.”
Then the completed the high-performance node with the computing elements, the memory pool, and a programmable controlling core. This combination of features gives 1024 GFLOPs of compute at BF16 — which goes down to 64 GFLOPs at FP32 (the single-precision format is more used in performance tests). Finally, what makes these training nodes capable of scaling without worsening their performance is that they’re designed to be highly modular. That is, they’re connected in such a way that the computing capabilities are constant and they form a high-throughput communication plane.
D1 Chip — Comparable to the best GPUs out there
Putting together 354 training nodes results on 22.6 TFLOPs of compute at FP32 — for comparison purposes, the Nvidia A100 delivers 19.5 TFLOPs — and an on-chip bandwidth of 10 TBps in each direction. Around the set of nodes, they put an array of high-speed, low-power lanes to get an off-chip I/O bandwidth of 4 TBps per edge — which is twice the I/O bandwidth of state-of-the-art network switch chips. All together forms Tesla’s D1 chip.
In contrast with other chips out there like the Nvidia A100, the D1 chip is entirely purposed to train machine learning models. Its unique design provides “GPU-level compute, CPU-level flexibility, and twice the network chip-level I/O bandwidth.” Here’s the comparison (off-chip bandwidth vs TFLOPs of compute) with state-of-the-art machine learning chips, including Google’s TPU, modern GPUs, and Startup chips.
The chips can be connected seamlessly, without glue, scaling the computational capacity and communication in every direction while keeping minimal latency between chips. The envisioned compute plane comprises ~500,000 training nodes and 1,500 D1 chips. But how could they integrate the chips to create such a compute plane and connect it with the rest of the high-level components — host systems and interface processors?
Training tile — A magnificent piece of engineering
The answer is training tiles. 25 D1 chips are integrated onto a fan-out wafer process so that they preserve the high bandwidth. Additionally, they put connectors on the edges to preserve the off-chip I/O bandwidth. The resulting component is what they call the training tile, which provides 9 PFLOPs at BF16 and 36 TB/s off-chip I/O bandwidth. This perhaps makes the training tile the “biggest organic mcm (multi-chip module) in the chip industry.”
They designed the training tile to meet the criteria of high bandwidth and low latency across the computing plane, but they soon realized they needed to find new solutions to enable its manufacturing. To feed power into the training tile, they created a custom voltage regulator module that would go directly onto the fan-out wafer. They also integrated the electrical, thermal, and mechanical pieces to create a fully integrated training tile. The power supply and cooling are orthogonal to the compute plane, allowing for high performance, high bandwidth, and low latencies.
The training tile goes against the trends in the industry of “cutting the wafer into pieces,” says Chanan Bos of CleanTechnica. “This is completely unprecedented.”
ExaPOD — Tesla’s new supercluster
To build the cluster they just had to put together tiles. A 2×3 tile matrix forms a tray, and two trays together form a cabinet. ExaPOD consists of 10 cabinets. But, keeping in mind the necessities for high-bandwidth, they “broke the cabinets’ walls,” and connected the trays one after another, creating a “seamless training mat.”
The ExaPOD provides 1.1 EFLOPs at BF16 (120 training tiles, 3000 D1 chips, and +1M training nodes), which makes Dojo almost as powerful as the GPU cluster Tesla is now using for training its networks. Thanks to the highly distributed modular design, it’s possible to use any subset of Dojo — called DPUs, Dojo Processing Units — for training purposes.
The high-bandwidth low-latency fabric allows Dojo to perform 4x better than any other AI supercomputer at the same cost while keeping the carbon footprint five times smaller and saving more energy (1.3 times per W).
Elon Musk said at the end of the presentation that Dojo could be operational next year. If this wasn’t enough, Tesla has already thought of the next-generation plan that would allegedly provide 10x improvement over the 1st Dojo computer.
At the highest level, there are two keys to take away from the presentation on Dojo. First, building all the hardware in-house allows Tesla to achieve unmatched performance for training AI models and permits full vertical integration. Second, designing all the components to be highly modular helps keeping the bandwidth very high and the latency very low, two requirements to achieve such performance improvements. Tesla is again promising big, let’s see what they can deliver.
A fair comparison
The TOP500 project presents the most powerful non-distributed supercomputers in the world twice a year. This years’ June issue gives the first spot to the Japanese Fugaku, which achieves 442.01 PFLOPs per second. If we were to compare the ExaPOD’s performance of 1.1 EFLOPs with this, we’d surely conclude that Tesla is about to build not only the fastest supercomputer in the world, but it’d be twice as powerful as the current number one.
There are two reasons why Dojo won’t be crowned as the fastest supercomputer. First, the high-performance computers (HPC) that are considered for the TOP500 list have to be capable of performing many different tasks. Dojo’s specificity prevents it from qualifying for the status of HPC.
Second, the performance tests of HPCs are conducted on simple- or double-precision formats. That is, FP64 or FP32. Dojo achieves 1.1 EFLOPs at BF16 (Brain Floating Point Format 16. “Brain” is for Google Brain), which computes half the number of bits as the FP32. And it doesn’t support FP64, which is needed for the most demanding scientific computations.
However, for illustration purposes, we can calculate how many computations per second Dojo can do. Because Tesla disclosed the performance of a D1 chip at both BF16 and FP32, it’s possible to make the conversion to calculate the computing capability of Dojo at FP32. (This procedure isn’t exactly right because we can’t simply scale performance linearly from the chip to the cluster, but it serves us to make a rough comparison.)
The D1 chip gives 22.6 TFLOPs at FP32 and 362 TFLOPs at BF16. The ExaPOD gives 1.1 EFLOPs at BF16. Doing the math we have: Dojo performance at FP32 = 1.1 EFLOPs (BF16)/362 TFLOPs (BF16) · 22.6 TFLOPs (FP32) = 68.67 PFLOPs at FP32. If we assume the calculations are sufficiently accurate, Dojo is slightly less powerful than the current cluster Tesla is using, which provides ~90 PFLOPs.
In any way, Dojo is way more efficient in terms of costs and pollution and, in terms of AI training, no computer will probably beat Dojo for a long time.
A unique design
To create a system perfectly aligned with AI systems’ necessities Tesla engineers needed to break some rules and make a few innovations with respect to the industry standards. Bos has a very thorough review here on this topic.
Tesla’s D1 chip is a “system on a chip,” or SoC. A chip that’s an SoC includes cache memory, a processor, a graphics card, and other components integrated within. Nowadays most chips are designed like this. However, there are a few important differences between the D1 chip and other similar chips, and between the training tile and other MCMs.
The first thing any expert in computing hardware would realize is that Tesla promises a level of performance in the training tile that usually can’t be defined a priori with total certainty. The reason is the way the chips are usually integrated into the training tile.
Chips aren’t made by placing their components individually. Instead, the elements of the chip are integrated into patterns in a slim circular piece of high-quality silicon, called a wafer. This wafer is then broken down into pieces that make up the processors (GPUs, SoCs, and so on).
In the process of breaking the wafer, some chips can become partially useless. That’s why it’s unusual that Tesla can promise a flawless performance from the training tile (which following the standard in the industry would be a piece of the broken wafer). How can they make sure the 25 D1 chips work perfectly fine in the training tile when sometimes the chips don’t work as intended?
There are two possibilities. Tesla engineers may have found a way to ensure a perfectly working 5×5 grid of D1 chips when the piece is extracted from the larger wafer. Another option is that the training tile is itself the whole wafer. In any case, there’s groundbreaking innovation because by doing this they can guarantee the performance of the ExaPOD from the design of the D1 chip.
The second great difference is that computers always have a RAM (random access memory) component outside the chips, but Dojo doesn’t. There are two types of RAM; SRAM (static RAM, for instance, the cache memory is SRAM) and DRAM (dynamic RAM). The main advantage of the SRAM is that it’s faster to access and consumes less energy. On the other side, DRAM is denser and so it can host more data in the same space. Both are usually necessary but Tesla has designed Dojo so that it doesn’t need DRAM.
The training nodes have 1.25 MB of SRAM each. Bos argues that it’s probably one of the faster types of SRAM, L2 cache — which has a response time of 3–4 ns (in contrast to the response time of 60 ns of DRAM). By putting 354 training nodes in each D1 chip, it amounts to 442.5 MB of cache per chip, which is more than any other chip out there.
So what we get here is a D1 chip that has enough SRAM to not need an external DRAM nor a shared cache either. “As strange as the design sounds, the missing components that you would usually expect to find in an SoC might have been unnecessary,” Bos says. “This is a very specific system fine-tuned to a very particular task whereas most processors have a wider array of components to be more flexible to fit all kinds of tasks.”
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot