The fourth CXL memory sharing spec adds speed and extends distance for multi-rack systems.
The Computer eXpress Link (CXL) specification defines how pools of memory can be connected across the PCIe bus. There have been three generations. CXL 1 enabled x86 servers to access PCIe 5.0-linked memory in external devices like SmartNICs and DPUs. CXL 2 added memory pooling between servers and external devices, still with PCIe 5.0, while CXL 3 added switches and PCIe 6.0, so more servers and devices could share memory. Now we have CXL 4.0 which uses PCIe 7.0 to add speed and has additional features to increase memory pool span and bandwidth. The rising need for multi-rack AI servers is a target area for this.
Derek Rohde, an Nvidia Principal Engineer, and the CXL Consortium President and Treasurer, said: “The release of the CXL 4.0 specification sets a new milestone for advancing coherent memory connectivity, doubling the bandwidth over the previous generation with powerful new features.”
Nvidia’s NVlink point-to-point tech links its GPUs together so that they can directly share an HBM memory space without a host x86 server and its PCIe bus being needed. GPU servers’ high-bandwidth memory (HBM) can function in the CXL memory space as a Type 2 device, sharing its memory with a host (x86) processor and they can be linked over CXL, but at a slower-than-NVLink rate. NVLink 5.0 provides up to 1,800 GB/s bandwidth per B200 GPU. PCIe 7.0 provides up to 1,024 GB/s per CPU.
CXL 4.0 doubles link bandwidth to 128 GT/s at the same latency as before, and introduces a native x2 width concept plus bundled ports, and support for 4 retimers to extend link distance.
Native x2 link widths are present to increase the fan-out. The link width specifies the number of parallel data path lanes in a CXL connection, with CXL 1.0 to 3.0 supporting x4, x8, and x16 widths. The single width and x2 widths were fallback widths for lane failure and error recovery, operating in a slower, degraded mode. Now x2 is fully optimized for performance, like x4 to x16 widths.
Bundled ports aggregate multiple physical CXL device ports into a single logical entity to increase bandwidth and connectivity. A CXL 4.0 white paper explains the concept.
PCIe signal quality degrades over distance, and as the data rate increases. Retimers are integrated-circuit analog/digital devices that take in a partially degraded signal and refresh it, using a clock and data recovery circuit. As we understand it, four retimers will enable the underlying PCIe link to be extended to support multi-rack configurations. We might hope to see CXL 4 multi-rack systems to be implemented in the late 2026 to 2027 period. Download a CXL 4.0 spec here.