Original Source Here
Why TCP/IP Stack is Highly Inefficient for High-Performance Computing Systems
Performance inefficiencies of TCP/IP in AI training clusters
In the last decade, the amount of data has been growing very rapidly. A single computer is no longer up to task for these large data computations. Multiple computers are being put together to process data and perform computations. But high computation power is just one piece to achieve low job completion times in AI training clusters. The network interconnects between these systems should also be of high performance. If not, the overall job completion time will always be bottlenecked by the network.
Ethernet has been evolving in the last two decades from 100Mbps to 400Gbps. With this evolution in hardware and network cards, the data could be transferred quickly over the network at a very high rate. But to achieve this, the system should also be able to process the packets fast enough. If not, we will never to able to utilize the complete network bandwidth.
TCP Processing Overhead
The Transmission Control Protocol (TCP) is a widely accepted transport layer protocol in network stack in today’s networking world. TCP continues its dominance due to its reliability, adaptability, and robustness for a wide range of applications.
Once the data is transferred over the network, there is a lot of processing involved in the transport layer. This processing is required to achieve the reliability and robustness of the TCP protocol. Few of them include flow control, congestion control, checksum computing, and passing the data to the application.
The evolution of gigabit-speed networks in the last few years has challenged TCP mainly in two aspects — performance and CPU requirements.
Let’s look into CPU requirements for protocol processing. In theory, a 1-bit transfer will need 1 CPU cycle per second for transport protocol processing. As the transmission bandwidth of the network card increases and more data is transferred over the network, the system will consume a lot of CPU cycles for protocol processing. The majority of the CPU will be consumed by network protocol processing rather than in application computation.
Let’s get into performance issues related to the protocol processing. One overhead in protocol processing that is affecting the performance is the operating system.
Protocol processing needs to do many things in the operating system, like taking interrupts, allocating packet buffers, freeing packet buffers, restarting IO devices, waking up the processes, and managing timers. These contribute to some portion of processing overhead but not the major one.
The main bottleneck in the protocol processing affecting performance is the multiple memory copies involved in the process.
In the IP protocol suite, when data is transferred, the following steps happen on the receiving end,
- Network Interface Controller will receive the data [1 read operation] and will interrupt the kernel
- Kernel identifies which application data belongs to and will wake up that application
- The application will copy the data from socket buffer to application buffer [1 read and 1 write operation]
- The checksum will be computed [1 read operation]
As you can see above, receiving a packet from a network requires four memory access operations. A 32-bit memory with a cycle time of 250 ns, typical for dynamic RAMS today, would imply a memory limit of 32 Mb/s. So memory access has to become much faster to utilize network bandwidth rate efficiently. These memory copies are the most significant performance inefficiency in TCP protocol.
Over the years, other technologies were developed to avoid the processing overheads and multiple copies in the IP network stack — InfiniBand and Remote Direct Memory Access.
InfiniBand and RDMA
InfiniBand is a networking communications standard used in high-performance computing systems for high throughput and low latency. It is used as data interconnect between computers. It is equivalent to ethernet, a physical layer standard, and ethernet provides connectivity up to 400Gbps, whereas Infiniband only provides up to 200Gbps.
Infiniband supports RDMA, which is a zero-copy technology. It bypasses the kernel in the process of communication and does not involve the CPU. This reduces a lot of CPU overhead and is much better at performance than TCP.
RDMA provides access to the memory of another system without involving either system’s operating system. The use of RDMA usually requires specialized networking hardware that implements InfiniBand. RDMA technology enables the Network Interface Controller to know the following information when a packet arrives in — which application this packet belongs to, and wherein application buffer memory should this packet be placed. With this information, the data will be written directly into the application buffer without going through the network stack or involving the operating system. This process requires InfiniBand API verbs to perform RDMA operations, and the application should support this API.
In recent years, RoCE — RDMA over Converged Ethernet was developed, providing RDMA features over a standard Ethernet Network Interface Controller without specialized hardware that supports Infiniband.
TCP has a high processing overhead due to overhead in the operating system and multiple copies of data in memory. This makes it highly inefficient for high-performance and low latency networks.
RDMA is widely used in high-speed, low latency networks like AI training clusters. RDMA has its drawbacks and is not suitable for all kinds of workloads, and it is less flexible than TCP and is more complex.
All workloads do not need high performance, low latency, and might not have heavy data transfers, which incur high processing overhead. Choosing between TCP vs. RDMA depends on the workload and performance requirements of the system.
Do check out my other related articles below.
- Yong Wan, Dan Feng, Fang Wang, Liang Ming, Yulai Xie, An In-Depth Analysis of TCP and RDMA Performance on Modern Server Platform (2010), https://ieeexplore.ieee.org/document/6310890
- David D. Clark, Van Jacobson, John Romkey, Howard Salwen, An Analysis of TCP Processing Overhead (1989), https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=29545
- Renato John Recio, Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck (2003), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.3915&rep=rep1&type=pdf
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot