The Nvidia SCADA scheme is ushering in GPU-controlled storage IO for AI inferencing workloads that will be faster than GPUDirect for small block transfers.
SCADA is an Nvidia term in a “Storage-Next” architecture. It stands for Scaled Accelerated Data Access and is a storage data IO scheme in which the GPUs in a GPU server directly initiate and control storage IO. This contrasts with GPUDirect, Nvidia’s existing protocol for speeding storage IO. Originally, GPUs were treated as ancillary accelerators by an x86 server that controlled the flow of data to and from them. It owned both the control path and the data path for the IO. GPUDirect took the data path away from the x86 CPU and enabled direct GPU memory-to-storage data transfer using RDMA to NVMe drives. The CPU still owned the control path. SCADA takes the control path away from the CPU as well.
AI training typically needs bulk data transfers and the control path time of the transfer is comparatively small. AI inferencing needs small block IOs, less than 4 KB, and the control path time of each transfer is relatively large. Nvidia research, downloadable here, found that having GPUs initiate such transfers would take less time and speed inferencing. SCADA is the result and an Nvidia FMS 2025 paper, “Advancing Memory and Storage Architectures for Next-Gen AI Workloads,” discusses it.
Nvidia is working with storage ecosystem partners to productize SCADA-using SSDs and controllers. Marvell makes SSD controllers, and a blog by Chander Chadha, its Director of Marketing for Flash Storage Products, says: “The AI infrastructure need is prompting storage companies to develop SSDs, controllers, NAND and other technologies fine-tuned to support GPUs – with an emphasis on higher IOPS (input/output operations per second) for AI inference – that will be fundamentally different from those for CPU-connected drives where latency and capacity are the bigger focus points.”
Chadha says: “The GPU initiates storage transactions within the SCADA framework which is built around memory semantics,” meaning load and store requests to which the SSD controller has to respond.
He says current SSDs cannot respond fast enough, in IOPS terms, “for data sets smaller than 4KB which results in an underutilized PCIe bus, leading to the GPU starving for data and wasting cycles.” The GPUs could need such data to sustain more than 1,000 parallel threads in inferencing workloads. AI training with CPU-initiated transfers needs fewer. Chadha says: “The number of GPU parallel threads is much lower – tens versus thousands – and data sets are larger in size.”
Faster PCIe buses, such as PCIe 6 and 7, will help, but SSD controllers also need updating with SCADA accelerator functions and “optimal error correction schemes for smaller payloads.”
Chadha sees SSDs emerging with controllers that can handle both types of workload, “capable of handling both PCIe and Ethernet traffic.” We should also, he says, “expect to see future work on interfacing with high bandwidth flash memory or CXL networks.”
Micron
NAND and SSD supplier Micron is also active in SCADA development. It has a PCIe Gen 6 SSD, the 9650, with “optimization for small-block operations.” The 7.68 TB model delivers up to 5.4 million random read IOPS. Micron demonstrated 44 of them delivering 230 million IOPS using the SCADA programming model at SC25.
The setup used these SSDs connected to Broadcom PEX90000 PCIe Gen 6 switches inside an H3 Platform Falcon 6048 PCIe Gen 6 server. This contained three Nvidia H100 PCIe Gen 5 GPUs.
Micron says the system “demonstrates linear scaling from 1 to 44 SSDs.” We see that the demo’s 230 million maximum IOPS number is quite close to the theoretical maximum of 44 drives’ aggregated individual 5.4 million random read IOPS; 237.6 million.
It concludes: “Combined with PCIe Gen6, high-performance SSDs, this [SCADA] architecture enables real-time data access for workloads like vector databases, graph neural networks and large-scale inference pipelines.”
Bootnote
The SCADA acronym has traditionally been used for Supervisory Control and Data Acquisition, referring to the telemetry world. Nvidia’s usage is different but analogous.