Data Management

SCADA

Published

SCADA – Scaled Accelerated Data Access – an Nvidia control system architecture AI inferencing IO to its GPUs. In the Nvidia Blackwell GPU environment this is a client-server runtime that runs in GPUs, functioning as a multilevel cache between the PCIe, CPU, and storage and the 100,000+ threads driving random I/Os in the GPU kernel:

  • It coalesces I/O requests within the GPU and maintains a read-through cache, converting random I/Os into either local cache hits within the GPU or batches of I/Os that are packed together before being passed over PCIe to either local NVMe or a remote SCADA server.
  • It takes full ownership of NVMe block devices and implements an NVMe driver inside the GPU. This keeps random I/Os from having to be processed on the host CPU.
  • It enables peer-to-peer PCIe in a way analogous to GPUDirect. This avoids sending I/Os all the way to host memory, and keeps traffic between GPUs and storage local to the PCIe switch they share.

There are a couple places during LLM inferencing where small, random reads have to happen repeatedly:

  1. KV cache lookups. As the response to an AI LLM like ChatGPT question is being built out word-by-word, the model needs to reference all the previous words in the conversation to decide what comes next. It doesn’t recompute everything from scratch; instead, it looks up cached intermediate results (the key and value vectors) from earlier in the conversation. These lookups involve many small reads from random places each time a new word is generated.
  2. Vector similarity search. When you upload a document to the LLM, the document gets broken into chunks, and each chunk is turned into a vector and stored in a vector index. When you then ask a question, it’s also turned into a vector, and the vector database searches the index to find the most similar chunks—a process that requires comparing the query vector against a bunch of small vectors stored at unpredictable places.

Just as GPUDirect Storage has become essential for efficient bulk data loading during training, SCADA is likely to become an essential part for efficiently inferencing in the presence of a lot of context—as is the case when using both RAG and reasoning tokens.

[Thanks to a Glenn Lockwood blog post.]

More information: SCADA is a client-server scheme for getting data from a storage server, with SCADA server software, to a GPU server, with SCADA client software, and provides fine-grained and accelerated, GPU-initiated access to stored data. It is a specialized technology framework developed by Nvidia as part of its CUDA ecosystem and was introduced around 2024. It addresses the challenges of handling massive datasets in GPU-accelerated computing environments, particularly for applications where data volumes exceed available memory.

Nvidia notes that feeding large data to GPUs requires support for 100,000 fine-grained GPU accesses to datasets that no longer fit in memory, and securing accesses to the GPU. New apps (GNNs, VectorDB) make fine-grained requests from every GPU thread to more data than can fit in the memory of many nodes. The SCADA programming model avoids painful out-of-memory errors with load/store and leverages NVMe to reduce total cost of ownership.

SCADA enables GPUs to directly and efficiently access large-scale datasets from storage without relying on CPU intermediaries, which traditionally introduce bottlenecks and overhead. It uses GPUDirect Storage extensions to allow up to 100,000 fine-grained GPU threads to pull data directly from storage, bypassing CPU involvement for “speed-of-light” performance. And it automatically scales data and compute resources, making it ideal for scenarios where datasets are too large to fit in GPU memory.

Also SCADA provides a single, unified API for data access that works seamlessly regardless of dataset size or compute cluster scale. This allows users to handle everything from single-node setups (e.g., 10 TB datasets) to distributed clusters without out-of-memory (OOM) errors or major code changes.

Nvidia graphic.

The data path protocol is implemented with DMA over PCIe or RDMA over InfiniBand or Ethernet. The control path protocol is implemented with secure IPC and/or RDMA. There is a new GPU-oriented proprietary protocol to take advantage of GPU parallelism and reduce the number of ‘doorbell rings.’