AI/ML

Pinecone rolls out dedicated read nodes to boost vector search performance

Published

Pinecone says its dedicated read nodes offer predictable performance and cost for billion vector-scale searches.

The company supplies a serverless managed vector database service that is hosted in the AWS, Azure or Google clouds, focuses on performance, and charged on a pay-as-you-go basis. Vectors are the mathematical representations of word, image, audio, and video chunks that are used by AI large language models when generating responses to queries, by searching a vector database. Pinecone says that, as workloads grow, most vector databases hit limits. Searches involving upwards of a billion or more vectors can cost a lot of money and take a long time to complete.

Pinecone customers are charged according to four plans: starter, standard, enterprise, or dedicated. They are billed for storage, write units, read units, backups, data imports and more on a per-request, pay-as-you-go basis beyond the minimums. Dedicated read nodes (DRNs) provide more predictable pricing and performance for high-throughput and spiky workloads, with a lower cost per query than with per-request pricing.

A read unit (RU), Pinecone’s AI chatbot says, “is a billing metric for serverless indexes that measures the compute, I/O, and network resources consumed by read operations.” RUs apply to queries, fetches, and lists. As an example, a query uses 1 RU for every 1 GB of namespace, with a minimum of 0.25 RU per query. RUs cost zero for the starter price plan, $16/million for the standard plan, $24/million for the enterprise plan, and have a custom price for the dedicated plan.

DRNs are charged on a per-node basis. Hourly per-node pricing is significantly more cost-effective than per-request pricing for sustained, high-QPS workloads and, Pinecone says, makes spend easier to forecast.

Pinecone DRN diagram

Standard reads involve using, behind the scenes, shared resources with other users, meaning there can be noisy neighbor and resource use issues. Pinecone says that because DRNs are isolated and have a so-called warm data path – memory plus a local SSD – they have a predictable low latency and high throughput under heavy load, with no noisy neighbors or shared queues. Having data on local SSDs also prevents latency-lengthening cold fetches.

DRNs scale, either in number, via replicas, or in capacity, via shards. There is little to no effort involved in setting up DRNs as Pinecone handles data movement and scaling behind the scenes.

Pinecone uses latency numbers to rank performance. P50 is the median latency, with 50 percent of queries completing faster, and P99 measures the tail latency, with 99 percent of queries completing faster. These are typically measured for query operations on dense or sparse indexes, including network overhead.

It cites one customer using DRNs to power metadata-filtered, real-time media search in their design platform. Across 135 million vectors, they sustain 600 queries per second (QPS) with a P50 latency of 45 ms and a P99 of 96 ms in production. That same customer ran a load test by scaling their DRN nodes and reached 2,200 QPS with a P50 latency of 60 ms and a P99 of 99 ms.

Dedicated read nodes are now available in public preview. The main use cases are billion vector-scale semantic search with strict latency requirements, high-QPS recommendation systems, mission-critical AI services with hard SLOs, and large enterprise or multi-tenant platforms that require performance isolation to prevent one heavy workload from degrading another.

Read more on Pinecone’s DRNs here and in a blog.