Nvidia’s basic context memory extension infrastructure

Published

In an effort to get to know the basic Nvidia KV cache extension infrastructure, the ICMSP (Inference Context Memory Storage Platform) better, we asked Nvidia some questions about the Vera Rubin Pod racks, preparing this initial picture of the ICMSP scheme;

The top image is a screengrab from Jensen Huang’s CES 2026 presentation – roughly at the 1 hour, 20 minute mark, showing the Vera Rubin SuperPod. The two far right racks (circled) are enlarged in the lower half of this graphic. The rightmost one is a set of network switches for off-pod networking. The left-hand one, the BlueField-4 (BF4) rack, contains the ICMSP enclosures, installed beneath a pair of Spectrum-X Ethernet networking switches. An Nvidia technical blog says that the ICMSP holds latency‑sensitive, reusable inference context and prestages it to increase GPU utilization.

We should note that a Vera Rubin compute tray contains 2 x Vera CPUs, 4 x Rubin GPUs, 4 x ConnectX-9 Spectrum-X superNICs, providing predictable, low‑latency, and high‑bandwidth RDMA connectivity, and a single BlueField-4 (BF4) DPU to handle storage and and security. This DPU also contains ConnectX-9 technology as well.

Nvidia’s Itay Ozery, Senior Manager, Networking Products, tells us that the BF4 rack contains 16 storage enclosures beneath the Spectrum-X switches. Each of these enclosures includes 4 x BlueField-4’s, 64 BF4s in total. Behind each BlueField-4, Huang said in his pitch, is 150 terabytes of context memory. That totals up to 16 x (4 x 150) = 9,600 TB.

Ozery says there are 16 x NVL72 GPU racks in a Vera Rubin superPod, each holding 72 Rubin GPUs. That makes for a total of 1,152 Rubin GPUs. Nvidia tells us “The inference context memory storage infrastructure can support up to 16TB for each GPU.”

In other words, this infrastructure can support 1,152 x 16 = 18,432 TB of context memory. Ozery says: “The storage infrastructure’s sole purpose is to serve inference context memory.” It doesn’t do anything else.

An individual ICMSP storage enclosure contains 4 BF4s with, Huang says, 150TB of NVMe SSD capacity behind each BF4. Who supplies the storage enclosures when a customer buys a Vera Rubin SuperPod? Ozery tells us: “The storage infrastructure for the Vera Rubin pod is designed, built, and delivered by our storage partners based on the Nvidia reference designs.”

Nvidiadiagramshowing KV cache tiers with ICMSP (G3) and external storage (G4). Nvidia says the ICMS “platform establishes a new G3.5 layer, an Ethernet-attached flash tier optimized specifically for KV cache. This tier acts as the agentic long‑term memory of the AI infrastructure pod that is large enough to hold shared, evolving context for many agents simultaneously, but also close enough for the context to be pre‑staged frequently back into GPU and host memory without stalling decode. … The G3.5 tier delivers massive aggregate bandwidth with better efficiency than classic shared storage.”

The ICMSP is a G3.5 tier, bridging the gap between the in-Pod rack G3 tier and the off-Pod G4 tier. Nvidia’s tech blog says: “Inference frameworks like Nvidia Dynamo use their KV block managers together with Nvidia Inference Transfer Library (NIXL) to orchestrate how inference context moves between memory and storage tiers, using ICMS as the context memory layer for KV cache. KV managers in these frameworks prestage KV blocks, bringing them from ICMS into G2 or G1 memory ahead of the decode phase.”

We’re told: “When combined with the Nvidia BlueField-4 processor running the KV I/O plane, the system efficiently terminates NVMe-oF and object/RDMA protocols.”

Nvidia Inference Context Memory Storage architecture within the Rubin platform, from inference pool to BlueField-4 ICMS target nodes. The architecture uses BlueField‑4 to accelerate KV I/O and control plane operations, across DPUs on the Rubin compute nodes and controllers in ICMS flash enclosures, reducing reliance on the host CPU and minimizing serialization and host memory copies. Additionally, Spectrum‑X Ethernet provides the AI‑optimized RDMA fabric that links ICMS flash enclosures and GPU nodes with predictable, low‑latency, high‑bandwidth connectivity.

The Nvidia blog states: “At the inference layer, NVIDIA Dynamo and NIXL manage prefill, decode, and KV cache while coordinating access to shared context. Under that, a topology-aware orchestration layer using Nvidia Grove places workloads across racks with awareness of KV locality so workloads can continue to reuse context even as they move between nodes.”

“At the compute node level, KV tiering spans GPU HBM, host memory, local SSDs, ICMS, and network storage, providing orchestrators with a continuum of capacity and latency targets for placing context. Tying it all together, Spectrum-X Ethernet links Rubin compute nodes with BlueField-4 ICMS target nodes, providing consistently low latency and efficient networking that integrates flash-backed context memory into the same AI-optimized fabric that serves training and inference.”

Our understanding is that the ICMSP storage enclosures are JBOFs. The software controlling and managing them is concerned to provide KV cache ‘record’ (aka key:value pair) storage for an AI workload running in one or more GPUs in the Vera Rubin SuperPod. These use a GPU’s high-bandwidth memory (HBM) and a CPU’s DRAM in a 2-tier scheme to hold context memory, and this data will be accessed with load and store instructions, not storage semantics. There will need to be then, we understand, some sort of specialized FTL (Flash Translation Layer) soft/firmware to change the KV cache memory addressing to the NVMe SSD’s storage-based addressing in the 3.5 KV cache tier provided by ICMSP.

Nvidia’s blog says: “the Nvidia DOCA framework introduces a KV communication and storage layer that treats context cache as a first class resource for KV management, sharing, and placement, leveraging the unique properties of KV blocks and inferencing patterns. DOCA interfaces inference frameworks, with BlueField-4 transferring the KV cache efficiently to and from the underlying flash media.”

A KV cache-specialized storage enclosure has to do one thing well, hold low-latency, high bandwidth cache data, and that doesn’t involve providing storage-based data services, such as snapshots, replication, data reduction, etc. Yet many storage suppliers are partnering Nvidia in its ICMSP effort: Cloudian, DDN, Dell, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, VAST Data and WEKA logos were displayed. for example, in Huang’s presentation during his ICMSP pitch.

Nvidia notes: “By leveraging standard NVMe and NVMe-oF transports, including NVMe KV extensions, ICMS maintains interoperability with standard storage infrastructure while delivering the specialized performance required for KV cache.”

Their storage products will be connected to the ICMSP storage enclosures and provide data services for the data in them, such as a further tier of KV cache data storage accessed at slower speed across an off-Pod network link – the G4 tier in the diagram above. Nvidia’s tech blog says: “With a large portion of latency-sensitive, ephemeral KV cache now served from the G3.5 tier, durable G4 object and file storage can be reserved for what truly needs to persist over time. This includes inactive multiturn KV state, query history, logs, and other artifacts of multiturn inference that may be recalled in later sessions.”

Nvidia says: “The DOCA framework supports open interfaces for broader orchestration, providing flexibility to storage partners to expand their inference solutions to cover the G3.5 context tier.”

We note that VAST Data has ported its software to BlueField-3 processors and its Ceres data enclosure has a BF3 hardware controller. No doubt we will see a BF4 version of Ceres.

Will we see some or all of the other storage suppliers above port their storage SW to BF4? That’s an interesting question – and we have no answer.

Bootnote

  1. We think the SSDs used in the ICMSP enclosures will likely be PCIe Gen 5 for speed reasons.
  2. An informative blog about Nvidia’s ICMSP can be read here. It notes that Nvidia’s Dynamo software provides KV block management. This includes: “native support for evicting KV cache from GPU memory, offloading it to CPU memory or external storage, and retrieving it later.” The blogger adds: “A key part of this is the new asynchronous transport library called NIXL, which allows KV cache to move anywhere in the memory hierarchy—HBM, Grace or Vera CPU memory, or fully off-rack storage—without interrupting ongoing GPU computation.”
  3. The G3 layer in Nvidia’s KV cache tiering scheme is what Hammerspace calls tier zero. Hammerspace’s CMO, Molly Presley, tells us: “Inference Context Memory is exactly the kind of GPU-side data and metadata pipeline Hammerspace was built for. We are actively working with Nvidia on BlueField-4 and we have a program in plan to support it as part of our roadmap this year. Hammerspace already delivers the core requirement — getting the right data to the right GPU at the right time — and ICM is a natural extension of that for inference-time state, KV caches, and context reuse at scale.”