RCCLX: Revolutionizing GPU Communication on AMD Platforms

RCCLX: Revolutionizing GPU Communication on AMD Platforms

3 Min Read

We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend.

Communication patterns for AI models are constantly evolving, as are hardware capabilities. We want to iterate on collectives, transports, and novel features quickly on AMD platforms. Earlier, we developed and open-sourced CTran, a custom transport library on the NVIDIA platform. With RCCLX, we have integrated CTran to AMD platforms, enabling the AllToAllvDynamic – a GPU-resident collective. While not all the CTran features are currently integrated into the open source RCCLX library, we’re aiming to have them available in the coming months.

In this post, we highlight two new features – Direct Data Access (DDA) and Low Precision Collectives. These features provide significant performance improvements on AMD platforms and we are excited to share this with the community.

Direct Data Access (DDA) – Lightweight Intra-node Collectives

Large language model inference operates through two distinct computational stages, each with fundamentally different performance characteristics:

  • The prefill stage processes the input prompt, which can span thousands of tokens, to generate a key-value (KV) cache for each transformer layer of the model. This stage is compute-bound because the attention mechanism scales quadratically with sequence length, making it highly demanding on GPU computational resources.
  • The decoding stage then utilizes and incrementally updates the KV cache to generate tokens one by one. Unlike prefill, decoding is memory-bound, as the I/O time of reading memory dominates attention time, with model weights and the KV cache occupying the majority of memory.

Tensor parallelism enables models to be distributed across multiple GPUs by sharding individual layers into smaller, independent blocks that execute on different devices. However, one important challenge is the AllReduce communication operation can contribute up to 30% of end-to-end (E2E) latency. To address this bottleneck, Meta developed two DDA algorithms.

  • The DDA flat algorithm improves small message-size allreduce latency by allowing each rank to directly load memory from other ranks and perform local reduce operations, reducing latency from O(N) to O(1) by increasing the data exchange from O(n) to O(n²).
  • The DDA tree algorithm breaks the allreduce into two phases (reduce-scatter and all-gather) and uses direct data access in each step, moving the same amount of data as the ring algorithm but reducing latency to a constant factor for slightly larger message sizes.

<img decoding="async" class="alignnone size-full wp-image-23620" src="https://allyoucantech.com/wp-content/uploads/2026/02/rcclx-revolutionizing-gpu-communication-on-amd-platforms.png" alt width="1442" height="780" srcset="https://allyoucantech.com/wp-content/uploads/2026/02/rcclx-revolutionizing-gpu-communication-on-amd-platforms.png 1442w, https://allyoucantech.com/wp-content/uploads/2026/02/rcclx-revolutionizing-gpu-communication-on-amd-platforms.png?resize=916,495 916w, https://allyoucantech.com/wp-content/uploads/2026/02/rcclx-revolutionizing-gpu-communication-on-amd-platforms.png?resize=768,415 768w, https://allyoucantech.com/wp-content/uploads/2026/02/rcclx-revolutionizing-gpu-communication-on-amd-platforms.png?

You might also like