๐Ÿ›  LiteLLM Rotating-Fallback Proxy

Local OpenAI-compatible gateway ยท 25+ verified upstreams ยท FREE vs PAID modes

Operator guide ยท Updated 2026-05-25

What this is, and why

A single endpoint at http://localhost:4000/v1 that any OpenAI-SDK client (Roo, Kilo, ChatGPT-style apps, raw curl) can target. LiteLLM rotates through 25+ free-tier and paid LLM upstreams behind one virtual model name. When one provider hits a rate limit or auth failure, the next one in the chain serves the request transparently. The caller never has to know โ€” or care โ€” which upstream produced the answer.

Built on 2026-05-25 to consolidate ~14 per-provider consult tools and remove the “which key do I use today?” cognitive overhead. All keys are read from a gitignored file outside the repo; nothing is hardcoded.

Quick start

Start the proxy

bash tools/start_litellm_proxy.sh --background

Or use the slash command /startvllmp in Claude Code. It is idempotent โ€” refuses to double-launch if already running.

Point your client at it

SettingValue
Base URLhttp://localhost:4000/v1
Modelfree-mode or paid-mode (or hybrid-model for back-compat)
API keyany non-empty string (e.g. anything)

The proxy accepts a placeholder key because real upstream authentication happens per-provider from the keys it loaded at startup. Your client doesn't need to know any of them.

Smoke test

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer anything" \
  -H "Content-Type: application/json" \
  -d '{"model":"free-mode","messages":[{"role":"user","content":"1+1?"}],"max_tokens":40}'

Response headers tell you which upstream served the call:

x-litellm-model-api-base:    https://api.groq.com/openai/v1
x-litellm-attempted-retries: 0
x-litellm-response-cost:     0.00002
x-litellm-response-duration-ms: 384

The two modes

You select a mode by passing one of these as the model field:

ModeUpstreamsCost to youUse for
free-mode 22 upstreams, all $0 or free-tier-bounded $0 (subject to per-provider free-tier daily/monthly caps) Routine chat, code generation, anything that doesn't need premium reasoning
paid-mode 8 upstreams, premium frontier models Real money โ€” billed per-token by the upstream provider (Anthropic, DeepSeek, Moonshot, etc.) Hard reasoning, large-context strategy work, tasks the free tier can't handle
free-mode-large 5 long-context upstreams (Gemini 1M, OpenRouter Ring 262k, NVIDIA, Fireworks, Hypereal) $0 Auto-triggered when a free-mode call exceeds the upstream's context window
paid-mode-large 3 long-context paid upstreams Paid Auto-triggered for oversize paid-mode prompts
hybrid-model / hybrid-model-large Backward-compat aliases (subset of free-mode) $0 Existing Roo/Kilo configs that target this name keep working

How to know which mode is active right now

Run /modevllmp in Claude Code (or python3 tools/vllmp_mode_status.py directly). Each model group shows UPSTREAMS / HEALTHY / COOLED / 60-min-request-count / overall STATUS (HEALTHY ยท DEGRADED ยท ALL COOLED).

Automatic fallback chain

Every group has a fallback target so a fully-cooled mode escalates rather than failing:

hybrid-model    โ†’ free-mode-large โ†’ free-mode โ†’ paid-mode
free-mode       โ†’ free-mode-large โ†’ paid-mode
free-mode-large โ†’ paid-mode-large โ†’ paid-mode โ†’ free-mode
paid-mode       โ†’ paid-mode-large โ†’ free-mode-large โ†’ free-mode
paid-mode-large โ†’ free-mode-large โ†’ paid-mode โ†’ free-mode

Plus context-window-specific fallbacks for the ContextWindowExceededError class so a prompt that overflows a small-context upstream gets re-routed to a 128k+ model automatically.

Provider inventory (verified 2026-05-25)

43 keys were tested by tools/verify_all_keys.py (re-runnable any time to re-audit). Of those, 27 are alive and wired into the rotation. The rest are listed with their status so you know what you have versus what's actively serving traffic.

Free-mode pool (active)

ProviderModelWhy it's "free"Notes
Groqllama-3.1-8b-instantFree tier (RPD cap)Sub-second, tool-call safe
NVIDIA NIM ร—2 keysmeta/llama-3.1-8b-instructGenerous free creditsPrimary + alt key rotated
Gemini ร—2 keysgemini-flash-latestFree tier (20 req/day/key)1M context โ€” daily cap hits fast under load
GitHub Models ร—2 keysgpt-4o-mini (via Azure OpenAI compat)Free with GitHub PATPer-tier 8k token request cap on some accounts
Fireworks ร—2 keyskimi-k2p5 (free model)Free model on free tierFast, 128k context
DeepInfra (alt key only)Meta-Llama-3.1-8B-InstructFree tierPrimary key dead, alt works
Nous ร—2 keysdeepseek/deepseek-v4-flash:freePer Nous portal: only this :free model is freeOther Nous models are paid
Mistral ร—3 keysmistral-small-latestFree tier32k context, tool-call safe
OpenRouterinclusionai/ring-2.6-1t:free variant on OpenRouter1-trillion-param reasoning, 262k context
OFOXz-ai/glm-4.7-flash:free:free variantChinese gateway, OpenAI-compat
LLM7gpt-4o-mini-2024-07-18Free tierPublic free LLM endpoint
Cerebrasllama3.1-8bFree tierSub-second; rejects tool schemas with minItems/maxItems
Together ร—2 keysMeta-Llama-3-8B-Instruct-LiteLite variant is free8k context โ€” fallback handles overflow
Bluesmindmeta/llama-3.1-8b-instructFree tier on aggregator160 models total, most are paid-tier-restricted
Hyperealgpt-5.5-instantWas free post-top-upWill be moved to paid-mode if billing changes

Paid-mode pool (active)

ProviderModelRoughly costsNotes
Anthropic ร—2 keysclaude-haiku-4-5Per-token, lowNative Anthropic API via /v1/messages
DeepSeek (native)deepseek-chatPer-token, very lowDirect DeepSeek API
Moonshot (native)moonshot-v1-8kPer-tokenDirect Kimi API
AIMLAPI (paid key)gpt-4o-mini$20 capFree AIMLAPI key is permanently exhausted
Hypereal ร—2 keysgpt-5.5-instantPer-token from credit poolTop up at hypereal.cloud as needed
OpenAI (direct)gpt-4o-miniPer-tokenCurrently out of quota โ€” auto-uses if topped up

Excluded (with reason โ€” re-runnable verify will resurface if status changes)

ProviderStatusHow to fix
xAI GrokKey reports as invalid per xAI's own responseRegenerate at console.x.ai
Cloudflare Workers AIDaily 10k-neuron quota exhaustedAuto-resets at UTC midnight; available via cloudflare-llama direct alias
HuggingFace (all 3 tokens)Account-pool monthly credits depletedWait for next month, or subscribe to PRO
Ollama CloudKey is an SSH-ed25519 public key โ€” needs JWT signing, not bearer authWould require a custom signing shim โ€” not implemented
Qwen (DashScope)Key invalid per providerRe-issue key in Alibaba console
ChutesAccount balance $0Top up
InceptionModels locked to accounts created before a cutoff dateNot fixable on this account
AIMLAPI free keyALL_TIME_LIMIT reached (permanent)Paid AIMLAPI key works and is in paid-mode
Cursor / Kilocode / OpencodeCLI auth tokens, not HTTPS chat APIsUse the corresponding /consult-* slash commands directly instead

Direct-target single-provider aliases

Bypass rotation by using one of these model names โ€” useful for testing one provider in isolation:

Slash commands (Claude Code)

CommandWhat it does
/startvllmp (alias /STARTVLLMP)Start proxy in background. Refuses if already running.
/restartvllmp (alias /RESTARTVLLMP)Kill existing + relaunch. Use after editing config or rotating a key.
/stopvllmpStop the proxy.
/statusvllmpPID, health, key load count, registered model groups, recent log tail.
/modevllmpThe mode dashboard. Per-group: UPSTREAMS / HEALTHY / COOLED / 60-min-req-count / STATUS. Tells you at a glance whether free or paid is degraded.
/testvllmp [N]Fire N test requests, report distinct upstreams hit (proves rotation works).
/tailvllmp [N]Tail the proxy log with rotation / cooldown / 429 lines highlighted.

All commands live in .claude/commands/ and are normal markdown โ€” feel free to edit.

Security & key management

Three-rule security model

(1) Keys live outside the repo, in a gitignored file in the user's home directory.
(2) The proxy launcher reads keys at startup and exports them into the process env before launching LiteLLM. Keys never appear in the YAML config โ€” only env-var references like os.environ/PROVIDER_API_KEY.
(3) The keys file is file-first when loading โ€” stale shell env vars from old sessions cannot silently override the canonical file values. This caught a bug where old GROQ_API_KEY values were causing 403s on live providers.

Where the keys live

What the proxy never does

Re-auditing keys

Whenever you add or rotate a key, re-run the audit:

python3 tools/verify_all_keys.py

The harness pings every provider's chat-completions endpoint with “1+1?” in parallel and reports OK / DEAD / NOFUNDS / QUOTA / NOKEY / CONFIG per key. Sends a real browser User-Agent so Cloudflare-fronted providers (Groq, Together, Cerebras, AIMLAPI, LLM7) don't return their bot-protection 403.

Smart cooldown observability

Every upstream failure goes through tools/litellm_smart_cooldown.py, a custom LiteLLM logger that:

  1. Classifies the failure into a category: daily_quota_cf ยท monthly_quota_hf ยท dead_key ยท rate_limit ยท server_error ยท bad_request ยท other
  2. Computes a meaningful unban time per category (CF โ†’ next UTC midnight, HF โ†’ first of next month, dead keys โ†’ 24h, transient โ†’ 300s)
  3. Writes the state to /tmp/litellm_cooldown_state.json with cool_until_utc, hit count, last error message, last seen time

Inspect anytime:

jq . /tmp/litellm_cooldown_state.json

Honest limitation

LiteLLM's router still enforces its own static 300s cooldown_time internally โ€” the smart callback gives the operator visibility into why something is parked and when it should recover, but dynamic per-deployment enforcement of those long unban windows would require monkey-patching router internals that change across LiteLLM versions. So the recorded categories are informational; actual re-try timing follows the static 300s.

Troubleshooting

"Connection error" in Roo / Kilo

Usually means the proxy is restarting. Wait 10 seconds and retry, or check /statusvllmp to confirm PID is alive.

413 "Request body too large"

One of the upstreams in the rotation has a small per-request cap (e.g. GitHub Models gpt-4o-mini on certain tiers is 8k tokens). The proxy auto-falls back to a long-context group via the generic fallbacks: rules. If you still see the error, the large group is also exhausted โ€” switch to paid-mode-large manually.

429 "MidStreamFallbackError"

An upstream started streaming and then hit a rate limit mid-stream. LiteLLM can't always recover from this cleanly. Usually it's Gemini hitting its 20 req/day free-tier cap. The next call will rotate to a different upstream. To make this less common, top up Gemini's billing or rely on paid-mode.

"NOKEY" in /modevllmp or verify_all_keys

Means the launcher couldn't find that provider's label in ~/dbpasses.txt. Either the label string doesn't match exactly, the key file has unexpected indentation, or the key block is structured differently than expected (e.g. label-then-sub-label-then-value). Open tools/start_litellm_proxy.sh and check the parse_key line for that provider against the actual file content.

One provider keeps cooling immediately after restart

Check jq . /tmp/litellm_cooldown_state.json for the category field:

Want to add a new provider

  1. Add the key to ~/dbpasses.txt with a clear label.
  2. Add a parse_key "LABEL FROM FILE" ENV_VAR_NAME line to tools/start_litellm_proxy.sh.
  3. Add an entry to tools/verify_all_keys.py for the audit harness (label, env var, test function).
  4. Run the audit; if it returns OK, add a model_list entry to litellm_config.yaml under the appropriate group (free-mode if zero-cost, paid-mode if billed).
  5. /restartvllmp and confirm with /modevllmp.