Routing & Failover¶

DeltaLLM can route one public model name to multiple deployments, retry failed calls, and fall back to another model group when needed.

For most teams, the easiest runtime workflow is:

Add two or more deployments for the same public model name in Models
Keep the default strategy first
Send traffic through the gateway
Check Route Groups, /health/deployments, and /health/fallback-events only when you need more control

Quick Path¶

If two deployments share the same model_name, DeltaLLM treats them as one routable group.

model_list:
  - model_name: gpt-4o-mini
    deployment_id: openai-primary
    deltallm_params:
      provider: openai
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      weight: 1

  - model_name: gpt-4o-mini
    deployment_id: openai-secondary
    deltallm_params:
      provider: openai
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY_2
    model_info:
      weight: 1

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 1
  retry_after: 1

With that in place, calls to gpt-4o-mini can be spread across both deployments and retried on failure.

How Routing Works¶

For each request, DeltaLLM:

Resolves the requested model name to a model group
Removes unhealthy or cooled-down deployments
Applies request tag filtering when metadata.tags is present
Optionally skips deployments already above configured RPM or TPM limits
Selects one deployment with the active routing strategy
Retries or falls back if the call fails in a retryable way

Pick A Strategy Fast¶

If you do not want to think too hard about routing on day one, use this:

simple-shuffle for most gateways
weighted when you want a planned traffic split such as 90/10
priority-based-routing when you want a primary deployment with a clear standby
least-busy when one deployment tends to get stuck with more in-flight work
rate-limit-aware when provider quotas are the main problem

Everything else is useful, but usually only after you have a specific reason.

Routing Strategies¶

Set the global default in router_settings.routing_strategy, or override it with a route-group policy in the Admin UI.

Strategy	What it does	Use it when
`simple-shuffle`	Randomly picks a healthy deployment	You want a low-maintenance default
`weighted`	Randomly picks by `model_info.weight`	You want a deliberate traffic split
`priority-based-routing`	Tries the lowest `priority` number first	You want primary and standby behavior
`least-busy`	Picks the deployment with the fewest active requests	Queue depth matters more than strict weighting
`latency-based-routing`	Prefers the best recent latency, while keeping new members eligible	Response time matters most
`cost-based-routing`	Prefers the lowest estimated unit cost for the request mode	You want the cheapest acceptable path
`usage-based-routing`	Prefers the deployment with the lowest current RPM/TPM utilization	You share quota across providers or keys
`rate-limit-aware`	Avoids deployments near configured RPM/TPM limits	You need to stay away from provider caps
`tag-based-routing`	Uses the same tag-filtered eligible pool as other strategies, then applies weighted choice	You want a route group that is explicitly tag-driven

Strategy Details¶

`simple-shuffle`¶

What it does: - Picks randomly from the healthy eligible pool - Ignores weight

Use it when: - Deployments are roughly equivalent - You want the simplest setup - You are starting out and want predictable behavior

Avoid it when: - You need a planned percentage split - One deployment should clearly be preferred over another

Setup:

router_settings:
  routing_strategy: simple-shuffle

`weighted`¶

What it does: - Picks randomly, but higher weight gets more traffic over time - Good for controlled rollout, canarying, or provider mix changes

Use it when: - You want 90/10, 80/20, or 50/50 - You want gradual migration from one deployment to another

Required deployment metadata: - model_info.weight

Setup:

model_list:
  - model_name: gpt-4o-mini
    deployment_id: primary
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {weight: 9}
  - model_name: gpt-4o-mini
    deployment_id: canary
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {weight: 1}

router_settings:
  routing_strategy: weighted

`priority-based-routing`¶

What it does: - Tries all priority 0 deployments first - Only falls through to higher numbers like 1 or 2 if the higher-priority pool is unavailable

Use it when: - You have a preferred primary provider - You want a warm standby deployment - Order matters more than spreading traffic

Required deployment metadata: - model_info.priority

Setup:

model_list:
  - model_name: gpt-4o-mini
    deployment_id: primary
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {priority: 0}
  - model_name: gpt-4o-mini
    deployment_id: standby
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {priority: 1}

router_settings:
  routing_strategy: priority-based-routing

`least-busy`¶

What it does: - Chooses the deployment with the fewest active in-flight requests - Useful when latency is mostly driven by queue depth

Use it when: - Deployments are similar, but traffic can clump - You want better balancing during bursts

Avoid it when: - You need explicit rollout percentages - Cost or provider quota is the main concern

Setup:

router_settings:
  routing_strategy: least-busy

`latency-based-routing`¶

What it does: - Uses recent observed latency to prefer faster deployments - New or unsampled deployments stay eligible instead of being starved forever

Use it when: - User-facing latency matters more than cost - Providers behave differently by region or load

Avoid it when: - You do not have enough steady traffic to build meaningful latency history

Setup:

router_settings:
  routing_strategy: latency-based-routing

`cost-based-routing`¶

What it does: - Estimates the cheapest eligible deployment for the current request mode - Works best when deployment pricing metadata is accurate

Use it when: - You have multiple providers for the same workload - Cost control matters more than tiny latency differences

Make sure you set pricing metadata that matches the workload: - token costs for chat, completions, embeddings, and rerank - image pricing for image generation - audio pricing for speech or transcription where applicable

Setup:

router_settings:
  routing_strategy: cost-based-routing

`usage-based-routing`¶

What it does: - Looks at recent request-per-minute and token-per-minute usage - Prefers the least utilized deployment

Use it when: - Several deployments share quota ceilings - You want to spread demand before you hit provider-side limits

Best with: - accurate rpm_limit and tpm_limit - steady traffic patterns - matching per-unit limits for non-text workloads, such as image_pm_limit, audio_seconds_pm_limit, char_pm_limit, or rerank_units_pm_limit

Setup:

model_list:
  - model_name: gpt-4o-mini
    deployment_id: east
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {rpm_limit: 600, tpm_limit: 300000}
  - model_name: gpt-4o-mini
    deployment_id: west
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {rpm_limit: 600, tpm_limit: 300000}

router_settings:
  routing_strategy: usage-based-routing

`rate-limit-aware`¶

What it does: - Filters out deployments already close to their configured RPM or TPM ceilings - Then picks from the remaining pool

Use it when: - Hitting provider rate limits is your biggest operational problem - You want to stay away from hot deployments before they fail

Required deployment metadata: - model_info.rpm_limit - model_info.tpm_limit

For non-text workloads, rate-limit-aware uses matching per-unit limits when present: - model_info.image_pm_limit - model_info.audio_seconds_pm_limit - model_info.char_pm_limit - model_info.rerank_units_pm_limit

Optional extra safety: - enable router_settings.enable_pre_call_checks

Setup:

router_settings:
  routing_strategy: rate-limit-aware
  enable_pre_call_checks: true

`tag-based-routing`¶

What it does: - Uses the same request-tag eligibility filtering that DeltaLLM already applies before strategy selection - Then uses weighted choice across the remaining tag-matched pool

Important: - DeltaLLM already respects metadata.tags as a general eligibility filter before strategy selection - Choose tag-based-routing when you want the route-group policy itself to communicate that tags are the main routing signal

Use it when: - You route by region, tenant tier, compliance boundary, or capability tag - Prompts or callers attach tags such as ["eu"] or ["vip"]

Required deployment metadata: - model_info.tags

Setup:

model_list:
  - model_name: gpt-4o-mini
    deployment_id: eu
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {tags: ["eu"]}
  - model_name: gpt-4o-mini
    deployment_id: us
    deltallm_params: {provider: openai, model: openai/gpt-4o-mini}
    model_info: {tags: ["us"]}

Client request:

{
  "model": "gpt-4o-mini",
  "messages": [{"role": "user", "content": "hello"}],
  "metadata": {"tags": ["eu"]}
}

Which Metadata Matters¶

Use these deployment fields when you need more control:

model_info.weight for weighted
model_info.priority for priority-based-routing
model_info.tags for tag-aware routing
model_info.rpm_limit and model_info.tpm_limit for usage-aware and rate-limit-aware routing
model_info.image_pm_limit for image-generation quota-aware routing
model_info.audio_seconds_pm_limit and model_info.char_pm_limit for audio quota-aware routing
model_info.rerank_units_pm_limit for rerank quota-aware routing
pricing metadata for cost-based-routing

Retries and Cooldowns¶

These settings apply to gateway-level retries:

router_settings:
  num_retries: 2
  retry_after: 1
  timeout: 600
  cooldown_time: 60
  allowed_fails: 0

num_retries: how many extra attempts DeltaLLM makes after the first failure
retry_after: base delay before retrying; backoff increases automatically
timeout: maximum request time before DeltaLLM treats the call as failed
cooldown_time: how long a failing deployment stays out of rotation
allowed_fails: how many failures are allowed before cooldown starts

Tip: verify the effective allowed_fails value in the config your environment actually applies. In practice that usually means your mounted config.yaml, your Helm values.yaml, or the rendered ConfigMap in the cluster.

When you publish a route-group policy, that group can override timeout and retry behavior without changing the global config.

Fallback Chains¶

Use fallback chains when one model group should hand work to another.

deltallm_settings:
  fallbacks:
    - gpt-4o:
        - gpt-4o-mini
  context_window_fallbacks:
    - gpt-4o-mini:
        - gpt-4o
  content_policy_fallbacks:
    - gpt-4o:
        - claude-3-sonnet

fallbacks: used for general failures such as timeouts, rate limits, and provider errors
context_window_fallbacks: used when the input is too large for the first model
content_policy_fallbacks: used when a provider rejects the content for policy reasons

Advanced Routing Controls¶

Use these settings when routing needs a little more control:

router_settings.enable_pre_call_checks filters out deployments already above configured RPM or TPM limits before the provider call
router_settings.model_group_alias lets clients call a friendly alias instead of the real group name

Monitor Routing¶

Use these endpoints during rollout and incident response:

GET /health/deployments for current deployment health
GET /health/fallback-events for recent retry and failover activity
GET /metrics for latency, traffic, cooldown, and failure metrics

The Admin UI Settings and Route Groups pages provide the same controls in a more operator-friendly form.

Routing & Failover¶

Quick Path¶

How Routing Works¶

Pick A Strategy Fast¶

Routing Strategies¶

Strategy Details¶

simple-shuffle¶

weighted¶

priority-based-routing¶

least-busy¶

latency-based-routing¶

cost-based-routing¶

usage-based-routing¶

rate-limit-aware¶

tag-based-routing¶

Which Metadata Matters¶

Retries and Cooldowns¶

Fallback Chains¶

Advanced Routing Controls¶

Monitor Routing¶

Related Pages¶

`simple-shuffle`¶

`weighted`¶

`priority-based-routing`¶

`least-busy`¶

`latency-based-routing`¶

`cost-based-routing`¶

`usage-based-routing`¶

`rate-limit-aware`¶

`tag-based-routing`¶