Caching¶
DeltaLLM can cache repeat requests to reduce latency and provider cost.
Quick Success Path¶
- Turn caching on
- Start with the in-memory backend
- Send the same request twice
- Check
x-deltallm-cache-hitin the response headers
What Gets Cached¶
The cache middleware is currently applied to these POST endpoints:
/v1/chat/completions/v1/completions/v1/responses/v1/embeddings
Streaming cache replay is currently supported only for /v1/chat/completions.
Choose a Backend¶
| Backend | Best for | Notes |
|---|---|---|
memory |
Local development or a single instance | No external dependency |
redis |
Shared cache across multiple instances | Best production default |
s3 |
Long-lived object-backed cache | Use when you need object storage instead of RAM or Redis |
Memory¶
Redis¶
S3¶
When you use s3, also configure the S3 callback settings under deltallm_settings.callback_settings.s3.
How Cache Keys Work¶
DeltaLLM builds cache keys from:
- the full request payload
- the target model
- relevant request parameters
- an optional custom cache key
- the authenticated request scope
This means two different API keys do not share cache entries by default.
Verify Cache Hits¶
Cached responses include:
| Header | Meaning |
|---|---|
x-deltallm-cache-hit |
true when the response came from cache |
x-deltallm-cache-key |
The cache key used for this response |
Control Caching Per Request¶
Disable or Relax Caching Through Metadata¶
{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello"}],
"metadata": {
"cache": "no-cache",
"cache_ttl": 120,
"cache_key": "my-shared-key"
}
}
Supported request-level controls:
metadata.cache: falseskips cache lookup and cache writemetadata.cache: "no-cache"skips lookup but allows a fresh writemetadata.cache: "no-store"allows lookup but skips writemetadata.cache_ttloverrides TTLmetadata.cache_keyprovides a custom logical key
Use HTTP Headers¶
The cache middleware also reads:
Cache-Control: no-cacheCache-Control: no-storeCache-TTL: <seconds>
Advanced Notes¶
- Cache accounting still updates request, usage, and spend metrics on cache hits
- Budget enforcement and auth still happen before a cached response is returned
- If caching is disabled globally, request-level cache metadata has no effect