Original Source Here
Xe-LP Microarchitecture Overview
Historically, Intel graphics chips are divided into different generations (GenX). Each generation is sub-divided into tiers of increasing performance, denominated as GTx. The 11th gen Intel® Core™ Processor has a GT2 chip, which is the most performant Intel GPU with the Xe-LP microarchitecture at the moment of writing this blog.
The high-end Gen12 GT2 GPU has 96 execution units, or EU (compared to 64 EUs in Gen11 GT2 or 24 EUs in Gen9 GT2 GPUs). Each of those units has SIMD8-wide floating-point and integer arithmetic logic units (ALUs). In addition, the Xe-LP GPUs have a larger L3 cache (3.8 MB), and separate shared local memory (768 KB) that is not a part of the L3 cache anymore.
The improvements do not stop at the number of EUs and the increase of cache sizes. The Xe-LP GPUs can operate at higher frequencies at the same voltage, which impacts the performance and power efficiency of all workloads.
Many GPU kernels suffer from low occupancy; meaning, the GPU is underutilized. To address this issue, the Xe-LP GPU allows running two concurrent execution contexts in parallel which can improve the performance in such cases.
And the most important feature for neural networks inference is a new instruction added in Xe-LP which is called DP4A. The specification of this instruction is similar to the Vector Neural Network Instructions (VNNI) available on Intel CPUs and allows us to do 64 operations per EU per clock for INT8 precision.
The compute throughput for the DP4A instruction is twice as high compared to the throughput for FP16 multiply-add (MAD) instruction that allows getting significant performance improvements for inference. This instruction makes Intel GPUs suitable for running networks quantized to 8 bits with a significant increase in performance.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot