Skip to main content

RDMA Explained: How Remote Direct Memory Access Works

·653 words·4 mins
AI RDMA HPC RoCEv2 Networking
Table of Contents

Remote Direct Memory Access (RDMA) is a key technology in High-Performance Computing (HPC), enabling ultra-fast and efficient data transfer between compute nodes. With RDMA over Converged Ethernet v2 (RoCEv2), organizations can achieve low-latency, lossless communication across data centers and AI workloads.

This article explains how RDMA works step-by-step — from memory registration to RDMA write operations — with diagrams to help you understand the process.

AI RDMA

Figure 1: Fine-Grained DRAM High-Level Architecture


What is RDMA and Why It Matters?
#

RDMA allows direct access to memory on remote systems without involving the CPU or OS, which significantly reduces latency and improves bandwidth efficiency. This is critical for:

  • AI and machine learning training (fast dataset transfer)
  • High-performance computing clusters (efficient inter-node communication)
  • Data-intensive workloads like real-time analytics, NVMe-over-Fabrics, and distributed storage

RoCEv2 uses UDP transport, which means it does not handle packet loss due to congestion. To maintain lossless data delivery, modern networks use:

  • Priority Flow Control (PFC) – prevents buffer overflow
  • Explicit Congestion Notification (ECN) – signals senders to slow down

RDMA Process Overview
#

When a client compute node (CCN) writes data to a server compute node (SCN), the process involves four main steps:

  1. Memory Allocation and Registration
  2. Queue Pair Creation
  3. Connection Initialization
  4. RDMA Write Operation

Each step is explained in detail below.


Step 1: Memory Allocation and Registration
#

  • Allocate a Protection Domain (PD), similar to a tenant or VRF in networking.
  • Register memory blocks, defining size and access permissions.
  • Receive keys:
    • L_Key (Local Key) for local access
    • R_Key (Remote Key) for remote write access

In our example:

  • CCN memory → local read access
  • SCN memory → remote write access

AI RDMA

Figure 2: Memory Allocation and Registration


Step 2: Queue Pair (QP) Creation
#

  • A Queue Pair (QP) = Send Queue + Receive Queue.
  • A Completion Queue (CQ) reports operation status.
  • Each QP is assigned a service type (Reliable Connection or Unreliable Datagram).

For reliable data transfer, we use RC (Reliable Connection).

  • Bind QP to PD and memory region.
  • Assign a Partition Key (P_Key), similar to VXLAN VNI.

Example:

  • CCN QP ID: 0x12345678
  • Associated P_Key: 0x8012

AI RDMA

Figure 3: Queue Pair Creation


Step 3: RDMA Connection Initialization
#

Connection setup involves REQ → REPLY → RTU messages:

  • REQ (Request): CCN sends Local ID, QP number, P_Key, and PSN.
  • Reply: SCN responds with IDs, QP info, and PSN.
  • RTU (Ready to Use): CCN confirms connection.

At the end, the QP state transitions from INIT → Ready to Send → Ready to Receive.

AI RDMA

Figure 4: RDMA Connection Initialization


Step 4: RDMA Write Operation
#

Once connected, the CCN application issues a Work Request (WR) containing:

  • OpCode: RDMA Write
  • Local buffer address + L_Key
  • Remote buffer address + R_Key
  • Payload length

The NIC then builds the required headers:

  • InfiniBand Base Transport Header (IB BTH) – includes P_Key and QP ID
  • RDMA Extended Transport Header (RETH) – includes R_Key and data length
  • UDP header (port 4791) indicates IB BTH follows

Data is encapsulated and sent over Ethernet/IP/UDP/IB BTH/RETH.

AI RDMA

Figure 5: Generating and Posting an RDMA Write Operation

On the SCN side:

  • Validate P_Key and R_Key
  • Translate virtual to physical memory
  • Deliver data to the QP’s receive queue
  • Notify application via the completion queue

AI RDMA

Figure 6: Receiving and Processing an RDMA Write Operation


Key Benefits of RDMA
#

  • 🚀 Ultra-low latency data transfer
  • High bandwidth efficiency for HPC and AI
  • 🔒 Bypasses CPU/OS overhead for direct memory access
  • 📡 Lossless transport with PFC + ECN
  • 🔄 Scalable design for large clusters and data centers

Conclusion
#

RDMA technology, especially with RoCEv2 over IP Fabrics, is essential for modern data-intensive workloads. By offloading memory operations from CPUs and enabling direct memory access across compute nodes, RDMA improves performance in AI training, big data analytics, distributed storage, and cloud-scale HPC systems.

Organizations adopting RDMA can expect:

  • Faster application performance
  • Lower latency in AI/ML pipelines
  • Better efficiency in multi-node HPC environments

🔗 Original article: Detailed Explanation of the RDMA Working Process

Related

Storage Requirements in LLM Training: Data and Checkpoints
·644 words·4 mins
AI LLM
NVIDIA’s Core Moat: CUDA
·532 words·3 mins
AI GenAI NVIDIA GPU CUDA
Moore Threads Develops MTLink to Challenge Nvidia NVLink
·450 words·3 mins
AI Moore Thread MTLink NVIDIA NVLINK