🛠 LiteLLM Rotating-Fallback Proxy

Local OpenAI-compatible gateway · 25+ verified upstreams · FREE vs PAID modes

Operator guide · Updated 2026-05-25

What it is Quick start Modes Providers Slash commands Security Troubleshooting

What this is, and why

A single endpoint at http://localhost:4000/v1 that any OpenAI-SDK client (Roo, Kilo, ChatGPT-style apps, raw curl) can target. LiteLLM rotates through 25+ free-tier and paid LLM upstreams behind one virtual model name. When one provider hits a rate limit or auth failure, the next one in the chain serves the request transparently. The caller never has to know — or care — which upstream produced the answer.

Built on 2026-05-25 to consolidate ~14 per-provider consult tools and remove the “which key do I use today?” cognitive overhead. All keys are read from a gitignored file outside the repo; nothing is hardcoded.

Quick start

Start the proxy

bash tools/start_litellm_proxy.sh --background

Or use the slash command /startvllmp in Claude Code. It is idempotent — refuses to double-launch if already running.

Point your client at it

Setting	Value
Base URL	`http://localhost:4000/v1`
Model	`free-mode` or `paid-mode` (or `hybrid-model` for back-compat)
API key	any non-empty string (e.g. `anything`)

The proxy accepts a placeholder key because real upstream authentication happens per-provider from the keys it loaded at startup. Your client doesn't need to know any of them.

Smoke test

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer anything" \
  -H "Content-Type: application/json" \
  -d '{"model":"free-mode","messages":[{"role":"user","content":"1+1?"}],"max_tokens":40}'

Response headers tell you which upstream served the call:

x-litellm-model-api-base:    https://api.groq.com/openai/v1
x-litellm-attempted-retries: 0
x-litellm-response-cost:     0.00002
x-litellm-response-duration-ms: 384

The two modes

You select a mode by passing one of these as the model field:

Mode	Upstreams	Cost to you	Use for
free-mode	22 upstreams, all $0 or free-tier-bounded	$0 (subject to per-provider free-tier daily/monthly caps)	Routine chat, code generation, anything that doesn't need premium reasoning
paid-mode	8 upstreams, premium frontier models	Real money — billed per-token by the upstream provider (Anthropic, DeepSeek, Moonshot, etc.)	Hard reasoning, large-context strategy work, tasks the free tier can't handle
free-mode-large	5 long-context upstreams (Gemini 1M, OpenRouter Ring 262k, NVIDIA, Fireworks, Hypereal)	$0	Auto-triggered when a free-mode call exceeds the upstream's context window
paid-mode-large	3 long-context paid upstreams	Paid	Auto-triggered for oversize paid-mode prompts
hybrid-model / hybrid-model-large	Backward-compat aliases (subset of free-mode)	$0	Existing Roo/Kilo configs that target this name keep working

How to know which mode is active right now

Run /modevllmp in Claude Code (or python3 tools/vllmp_mode_status.py directly). Each model group shows UPSTREAMS / HEALTHY / COOLED / 60-min-request-count / overall STATUS (HEALTHY · DEGRADED · ALL COOLED).

Automatic fallback chain

Every group has a fallback target so a fully-cooled mode escalates rather than failing:

hybrid-model    → free-mode-large → free-mode → paid-mode
free-mode       → free-mode-large → paid-mode
free-mode-large → paid-mode-large → paid-mode → free-mode
paid-mode       → paid-mode-large → free-mode-large → free-mode
paid-mode-large → free-mode-large → paid-mode → free-mode

Plus context-window-specific fallbacks for the ContextWindowExceededError class so a prompt that overflows a small-context upstream gets re-routed to a 128k+ model automatically.

Provider inventory (verified 2026-05-25)

43 keys were tested by tools/verify_all_keys.py (re-runnable any time to re-audit). Of those, 27 are alive and wired into the rotation. The rest are listed with their status so you know what you have versus what's actively serving traffic.

Free-mode pool (active)

Provider	Model	Why it's "free"	Notes
Groq	llama-3.1-8b-instant	Free tier (RPD cap)	Sub-second, tool-call safe
NVIDIA NIM ×2 keys	meta/llama-3.1-8b-instruct	Generous free credits	Primary + alt key rotated
Gemini ×2 keys	gemini-flash-latest	Free tier (20 req/day/key)	1M context — daily cap hits fast under load
GitHub Models ×2 keys	gpt-4o-mini (via Azure OpenAI compat)	Free with GitHub PAT	Per-tier 8k token request cap on some accounts
Fireworks ×2 keys	kimi-k2p5 (free model)	Free model on free tier	Fast, 128k context
DeepInfra (alt key only)	Meta-Llama-3.1-8B-Instruct	Free tier	Primary key dead, alt works
Nous ×2 keys	deepseek/deepseek-v4-flash:free	Per Nous portal: only this :free model is free	Other Nous models are paid
Mistral ×3 keys	mistral-small-latest	Free tier	32k context, tool-call safe
OpenRouter	inclusionai/ring-2.6-1t	:free variant on OpenRouter	1-trillion-param reasoning, 262k context
OFOX	z-ai/glm-4.7-flash:free	:free variant	Chinese gateway, OpenAI-compat
LLM7	gpt-4o-mini-2024-07-18	Free tier	Public free LLM endpoint
Cerebras	llama3.1-8b	Free tier	Sub-second; rejects tool schemas with minItems/maxItems
Together ×2 keys	Meta-Llama-3-8B-Instruct-Lite	Lite variant is free	8k context — fallback handles overflow
Bluesmind	meta/llama-3.1-8b-instruct	Free tier on aggregator	160 models total, most are paid-tier-restricted
Hypereal	gpt-5.5-instant	Was free post-top-up	Will be moved to paid-mode if billing changes

Paid-mode pool (active)

Provider	Model	Roughly costs	Notes
Anthropic ×2 keys	claude-haiku-4-5	Per-token, low	Native Anthropic API via /v1/messages
DeepSeek (native)	deepseek-chat	Per-token, very low	Direct DeepSeek API
Moonshot (native)	moonshot-v1-8k	Per-token	Direct Kimi API
AIMLAPI (paid key)	gpt-4o-mini	$20 cap	Free AIMLAPI key is permanently exhausted
Hypereal ×2 keys	gpt-5.5-instant	Per-token from credit pool	Top up at hypereal.cloud as needed
OpenAI (direct)	gpt-4o-mini	Per-token	Currently out of quota — auto-uses if topped up

Excluded (with reason — re-runnable verify will resurface if status changes)

Provider	Status	How to fix
xAI Grok	Key reports as invalid per xAI's own response	Regenerate at console.x.ai
Cloudflare Workers AI	Daily 10k-neuron quota exhausted	Auto-resets at UTC midnight; available via `cloudflare-llama` direct alias
HuggingFace (all 3 tokens)	Account-pool monthly credits depleted	Wait for next month, or subscribe to PRO
Ollama Cloud	Key is an SSH-ed25519 public key — needs JWT signing, not bearer auth	Would require a custom signing shim — not implemented
Qwen (DashScope)	Key invalid per provider	Re-issue key in Alibaba console
Chutes	Account balance $0	Top up
Inception	Models locked to accounts created before a cutoff date	Not fixable on this account
AIMLAPI free key	ALL_TIME_LIMIT reached (permanent)	Paid AIMLAPI key works and is in paid-mode
Cursor / Kilocode / Opencode	CLI auth tokens, not HTTPS chat APIs	Use the corresponding `/consult-*` slash commands directly instead

Direct-target single-provider aliases

Bypass rotation by using one of these model names — useful for testing one provider in isolation:

cloudflare-llama — direct CF Workers AI
nvidia-deepseek-v4-pro — direct NVIDIA NIM with DeepSeek
openrouter-ring-1t — direct OpenRouter Ring 2.6 1T
claude-haiku-direct — direct Anthropic Claude Haiku 4.5
deepseek-chat-direct — direct DeepSeek native

Slash commands (Claude Code)

Command	What it does
`/startvllmp` (alias `/STARTVLLMP`)	Start proxy in background. Refuses if already running.
`/restartvllmp` (alias `/RESTARTVLLMP`)	Kill existing + relaunch. Use after editing config or rotating a key.
`/stopvllmp`	Stop the proxy.
`/statusvllmp`	PID, health, key load count, registered model groups, recent log tail.
`/modevllmp`	The mode dashboard. Per-group: UPSTREAMS / HEALTHY / COOLED / 60-min-req-count / STATUS. Tells you at a glance whether free or paid is degraded.
`/testvllmp [N]`	Fire N test requests, report distinct upstreams hit (proves rotation works).
`/tailvllmp [N]`	Tail the proxy log with rotation / cooldown / 429 lines highlighted.

All commands live in .claude/commands/ and are normal markdown — feel free to edit.

Security & key management

Three-rule security model

(1) Keys live outside the repo, in a gitignored file in the user's home directory.
(2) The proxy launcher reads keys at startup and exports them into the process env before launching LiteLLM. Keys never appear in the YAML config — only env-var references like os.environ/PROVIDER_API_KEY.
(3) The keys file is file-first when loading — stale shell env vars from old sessions cannot silently override the canonical file values. This caught a bug where old GROQ_API_KEY values were causing 403s on live providers.

Where the keys live

Canonical source: ~/dbpasses.txt — gitignored, outside the repo. Owner-readable only.
Label format: human-readable line (e.g. NVIDIA:, GROQ FREE KEY:, OPEN ROUTER API KEY) followed by the key value on the next non-blank line.
The launcher's parser tolerates trailing whitespace on labels, skips sub-header lines (pure-uppercase identifiers like sub-label markers), skips URL annotation lines, and skips section divider lines (=====).

What the proxy never does

Never hardcodes keys in litellm_config.yaml — only os.environ/... references.
Never echoes keys in logs (LiteLLM sanitizes; the cooldown callback records error messages but those are upstream responses, not keys).
Never commits the keys file (gitignored).
Never accepts requests from outside localhost (LiteLLM binds 0.0.0.0:4000 by default — if you need network isolation, run with --host 127.0.0.1 or restrict at the OS firewall level).

Re-auditing keys

Whenever you add or rotate a key, re-run the audit:

python3 tools/verify_all_keys.py

The harness pings every provider's chat-completions endpoint with “1+1?” in parallel and reports OK / DEAD / NOFUNDS / QUOTA / NOKEY / CONFIG per key. Sends a real browser User-Agent so Cloudflare-fronted providers (Groq, Together, Cerebras, AIMLAPI, LLM7) don't return their bot-protection 403.

Smart cooldown observability

Every upstream failure goes through tools/litellm_smart_cooldown.py, a custom LiteLLM logger that:

Classifies the failure into a category: daily_quota_cf · monthly_quota_hf · dead_key · rate_limit · server_error · bad_request · other
Computes a meaningful unban time per category (CF → next UTC midnight, HF → first of next month, dead keys → 24h, transient → 300s)
Writes the state to /tmp/litellm_cooldown_state.json with cool_until_utc, hit count, last error message, last seen time

Inspect anytime:

jq . /tmp/litellm_cooldown_state.json

Honest limitation

LiteLLM's router still enforces its own static 300s cooldown_time internally — the smart callback gives the operator visibility into why something is parked and when it should recover, but dynamic per-deployment enforcement of those long unban windows would require monkey-patching router internals that change across LiteLLM versions. So the recorded categories are informational; actual re-try timing follows the static 300s.

Troubleshooting

"Connection error" in Roo / Kilo

Usually means the proxy is restarting. Wait 10 seconds and retry, or check /statusvllmp to confirm PID is alive.

413 "Request body too large"

One of the upstreams in the rotation has a small per-request cap (e.g. GitHub Models gpt-4o-mini on certain tiers is 8k tokens). The proxy auto-falls back to a long-context group via the generic fallbacks: rules. If you still see the error, the large group is also exhausted — switch to paid-mode-large manually.

429 "MidStreamFallbackError"

An upstream started streaming and then hit a rate limit mid-stream. LiteLLM can't always recover from this cleanly. Usually it's Gemini hitting its 20 req/day free-tier cap. The next call will rotate to a different upstream. To make this less common, top up Gemini's billing or rely on paid-mode.

"NOKEY" in /modevllmp or verify_all_keys

Means the launcher couldn't find that provider's label in ~/dbpasses.txt. Either the label string doesn't match exactly, the key file has unexpected indentation, or the key block is structured differently than expected (e.g. label-then-sub-label-then-value). Open tools/start_litellm_proxy.sh and check the parse_key line for that provider against the actual file content.

One provider keeps cooling immediately after restart

Check jq . /tmp/litellm_cooldown_state.json for the category field:

dead_key → regenerate the key at the provider's console
monthly_quota_hf → wait for first of next month or pay
daily_quota_cf → waits until UTC midnight automatically
rate_limit → transient, should clear on its own

Want to add a new provider

Add the key to ~/dbpasses.txt with a clear label.
Add a parse_key "LABEL FROM FILE" ENV_VAR_NAME line to tools/start_litellm_proxy.sh.
Add an entry to tools/verify_all_keys.py for the audit harness (label, env var, test function).
Run the audit; if it returns OK, add a model_list entry to litellm_config.yaml under the appropriate group (free-mode if zero-cost, paid-mode if billed).
/restartvllmp and confirm with /modevllmp.