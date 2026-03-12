Lightbits Labs and ScaleFlux have produced a 100x to 280x speed up of KV cache workloads using LightInferra cache SW reading data off ScaleFlux computational storage SSDs.

The two fed KV cache data to GPUs hosted in a FarmGPU datacenter and are showing this at Nvidia’s GTC conference next week. A KV Cache holds token vectors in a GPU’s high-bandwidth memory (HBM). When this HBM fills up the KV cache data blocks have to be recomputed, which takes time and slows AI training and inference, particularly as the number of tokens, from which vectors are generated, increases as the AI jobs increase in scale.

KV Cache software logically extends the cache to the GPU server’s embedded x86 CPU and its DRAM, and then out to the local NVMe drives on that x86 system, and further out to external NMVe SSDs to avoid the token vector recompute burden. Obviously the NVMe SSDs are much slower to access than HBM or DRAM, but fetching in already-computed token vectors is faster than recomputing tens of thousands of them. Lightbits and ScaleFlux say they can dramatically speed up KV Cache data fetching from the SSDs.

Arthur Rasmusson.

Lightbits Labs Arthur Rasmusson, Director of AI Architecture, said: “We’re transforming inference memory from a reactive cache into an intelligent, streamed data layer.”

How?

“By prefetching only the data that matters and delivering it to GPUs over high-speed RDMA before it's needed, we eliminate the stalls that traditionally limit long-context performance. The result is lower Time-to-First-Token (TTFT), more stable throughput under real-world load, and significantly higher effective GPU utilization.”

Keith McKay, Senior Director of Solutions Architecture and Technical Partnerships at ScaleFlux, said: “What we’re showing at GTC is an early look at how smarter data placement and persistent attention state management could help inference systems stay responsive as context windows grow. This is very much a collaboration we want to shape alongside real operators.”

Keith McKay.

Both Lightbits and ScaleFlux want neocloud operators to consider buying their SW and SSDs to avoid wasted GPU idle time.

Let’s look at what ScaleFlux contributes first and then turn to the more complicated Lightbits SW.

ScaleFlux supplies NVMe SSDs, Computational Storage Drives (CSD), with HW-based Write Reduction Technology (WRT), based on HW-accelerated compression, and metadata management using SoCs (System-on-Chips) that provide up to four times more logical than physical capacity and are transparent to hosts. It’s a member of the Open Flash Platform (OFP) group which wants to replace data servers, by providing massive scale AI hardware/software systems with ten times the density of existing file-based AI storage, lower latency, and using a tenth of the power.

What Lightbits provides on top of these drives is a way of prefetching KV Cache data before it is needed by the GPUs, so they don’t have to wait for external storage IO when their KV runs out of capacity or token vector recomputation. Its LightInferra SW uses cache algorithms specific to the KV Cache workload to get the data needed by the GPUs into their memory, using RDMA speed, before they actually need it.

Again, how?

It runs in a GPU server’s embedded x86 system and monitors KV Cache data block activity. With this data it runs a Sub-Linear Sparse Attention Prefetch (SLSAP) engine to identify KV blocks most likely to be needed next. The engine uses techniques like locality-sensitive hashing (LSH) combined with statistical reuse patterns (observing historical access locality in attention computations) to score and rank KV blocks, and then identify and select the ones most likely to be needed next by the GPUs.

We understand that this identification and block selection exploits sparsity in the way a GPU uses data blocks in its work. Most tokens only apply meaningfully to a small subset of prior tokens, and identifying and selecting these high-probability blocks dramatically shrinks the number of token vectors you need to pump back to the GPUs.

A second algorithm is based on understanding that recent tokens, semantically similar tokens, and certain structural patterns (e.g., in RAG or multi-turn chat) tend to be reused frequently.

LightInferra fetches these token blocks from the x86 server’s DRAM or, if not there, from the external ScaleFlux SSDs, and pre-loads them into the GPU’s HBM over RDMA links.

Lightbits has run benchmarks using large language model workloads to see the effect of its lowering time to first token (TTFT), compared to regenerating the cache's previous contents. The 100x to 280X acceleration numbers come from this table.

Lightbits LightInterra-ScaleFlux KV Cache benchmark table.

Of course, we’d love to see benchmark results comparing the Lightbits-ScaleFlux KV Cache acceleration scheme with KV Cache accelerators from DDN, Hammerspace, VAST Data, WEKA and others, but they are not available.

There are charts showing how LightInferra-ScaleFlux progressively improved on cache regeneration TTFT as the model size increases. Eg.;

Lightbits LightInterra-ScaleFlux KV Cache vs Cache-Regen as model size increases.

These are all log-scale charts, designed for computer science folks, but text makes the situation more comprehensible: “The outcome is sustained TTFT performance as context scales from 100 k tokens toward 1 million and beyond. As Johnmichael Hands of FarmGPU puts it, when a 400k-token conversation resumes and the system regenerates the entire KV cache from scratch, that is two minutes of GPU time producing zero tokens. LightInferra changes the economics completely—the same workload completes its first token in under half a second, turning a non-viable product tier into a profitable one.”

Jonmichael Hands.

Lightbits and ScaleFlux have neocloud GPU farms in mind, where there will be GPU pods running hundreds or thousands of simultaneous AI model workloads, each of them running out of HBM KV Cache capacity, and LightInferra-ScaleFlux rescuing them from hours and hours of GPU idle time, while token vectors are being fetched from ordinary external storage or, worse still, regenerated.

FarmGPU CEO Jonmichael Hands said: “Fast networked storage from Lightbits unlocks a lot of new use cases for long context inference. By pairing our managed service with Lightbit’s high-performance storage running on ScaleFlux NVMe, we are able to lower time to first token and increase utilization on GPUs, drastically lowering the TCO for inference.”

LightInferra video screen grab.

Read a Lightbits solution brief doc here and watch a LightInferra video here.