Leaders of Chinese wire suppliers

What are the specific requirements for network bandwidth and latency in AI training?

Release time: 2026-06-17

What are the specific requirements for network bandwidth and latency in AI training?
AI training (especially large-scale distributed training) has extremely strict requirements for network bandwidth and latency. According to existing technical specifications and industry practices, the specific requirements are as follows:

1. Extreme requirements for network bandwidth
AI training involves gradient synchronization among thousands of GPUs, which is a typical large-scale, high-throughput data transmission scenario. The demand for bandwidth grows exponentially:
Single node/single GPU bandwidth requirement: In typical training workloads, each GPU typically requires 400Gbps to 800Gbps of dedicated bandwidth for communication between nodes2. For example, a system with 8 GPUs, its total network bandwidth needs to reach 3.2Tbps to ensure each GPU has 400Gbps of communication bandwidth2.
Cluster total bandwidth scale: For AI clusters of the order of thousands of GPUs, the total bandwidth requirement of the cluster can reach Pbps (peta-bit per second) level. For example, a cluster with 10,000 GPUs, its GPU-to-GPU communication total bandwidth is approximately 1Pbps.
Continuous high utilization: During training, the network bandwidth utilization needs to remain at 80% to 95% to support large-scale, predictable data transmission (such as All-reduce operations)2.
2. Strict requirements for network latency
AI training has extremely low tolerance for latency and is extremely sensitive to network jitter, which directly affects the convergence speed and stability of training:
Microsecond-level response requirement: Scale-up networks (networks with close coupling between nodes) typically require the latency to be controlled below 1 microsecond (μs) to achieve fine-grained computing-communication integration4. For high-concurrency computing scenarios, the network needs to achieve microsecond-level latency5.
Specific latency thresholds: During the training phase, the latency of gradient synchronization between GPUs (P50) typically requires to be within 200-500 microseconds (μs), and the tail latency (P99) is within 1-2 milliseconds (ms) is acceptable2. If the latency is too high (such as exceeding 1.5 microseconds), it will cause "bubble" in GPU computing, significantly reducing the computing power utilization4.
Zero packet loss and low jitter: Based on the bucket principle, in collective communication processes, any packet loss or retransmission between any pair of GPUs will cause the delay of the entire collective communication, resulting in a large number of GPUs entering idle waiting time1. Therefore, AI training requires a long-term packet loss rate of less than 0.0001% and extremely low jitter tolerance, because jitter will directly affect the convergence and training stability of the model25.
In summary, AI training not only requires an ultra-large bandwidth of hundreds of Gbps or even Pbps to transfer massive data, but also requires microsecond-level ultra-low latency and an almost "zero packet loss" lossless network environment to ensure that thousands of computing nodes can work efficiently and collaboratively like a supercomputer.