Local OpenAI-compatible gateway ยท 25+ verified upstreams ยท FREE vs PAID modes
Operator guide ยท Updated 2026-05-25
A single endpoint at http://localhost:4000/v1 that any OpenAI-SDK client (Roo, Kilo, ChatGPT-style apps, raw curl) can target. LiteLLM rotates through 25+ free-tier and paid LLM upstreams behind one virtual model name. When one provider hits a rate limit or auth failure, the next one in the chain serves the request transparently. The caller never has to know โ or care โ which upstream produced the answer.
Built on 2026-05-25 to consolidate ~14 per-provider consult tools and remove the “which key do I use today?” cognitive overhead. All keys are read from a gitignored file outside the repo; nothing is hardcoded.
bash tools/start_litellm_proxy.sh --background
Or use the slash command /startvllmp in Claude Code. It is idempotent โ refuses to double-launch if already running.
| Setting | Value |
|---|---|
| Base URL | http://localhost:4000/v1 |
| Model | free-mode or paid-mode (or hybrid-model for back-compat) |
| API key | any non-empty string (e.g. anything) |
The proxy accepts a placeholder key because real upstream authentication happens per-provider from the keys it loaded at startup. Your client doesn't need to know any of them.
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer anything" \
-H "Content-Type: application/json" \
-d '{"model":"free-mode","messages":[{"role":"user","content":"1+1?"}],"max_tokens":40}'
Response headers tell you which upstream served the call:
x-litellm-model-api-base: https://api.groq.com/openai/v1
x-litellm-attempted-retries: 0
x-litellm-response-cost: 0.00002
x-litellm-response-duration-ms: 384
You select a mode by passing one of these as the model field:
| Mode | Upstreams | Cost to you | Use for |
|---|---|---|---|
| free-mode | 22 upstreams, all $0 or free-tier-bounded | $0 (subject to per-provider free-tier daily/monthly caps) | Routine chat, code generation, anything that doesn't need premium reasoning |
| paid-mode | 8 upstreams, premium frontier models | Real money โ billed per-token by the upstream provider (Anthropic, DeepSeek, Moonshot, etc.) | Hard reasoning, large-context strategy work, tasks the free tier can't handle |
| free-mode-large | 5 long-context upstreams (Gemini 1M, OpenRouter Ring 262k, NVIDIA, Fireworks, Hypereal) | $0 | Auto-triggered when a free-mode call exceeds the upstream's context window |
| paid-mode-large | 3 long-context paid upstreams | Paid | Auto-triggered for oversize paid-mode prompts |
| hybrid-model / hybrid-model-large | Backward-compat aliases (subset of free-mode) | $0 | Existing Roo/Kilo configs that target this name keep working |
Run /modevllmp in Claude Code (or python3 tools/vllmp_mode_status.py directly). Each model group shows UPSTREAMS / HEALTHY / COOLED / 60-min-request-count / overall STATUS (HEALTHY ยท DEGRADED ยท ALL COOLED).
Every group has a fallback target so a fully-cooled mode escalates rather than failing:
hybrid-model โ free-mode-large โ free-mode โ paid-mode
free-mode โ free-mode-large โ paid-mode
free-mode-large โ paid-mode-large โ paid-mode โ free-mode
paid-mode โ paid-mode-large โ free-mode-large โ free-mode
paid-mode-large โ free-mode-large โ paid-mode โ free-mode
Plus context-window-specific fallbacks for the ContextWindowExceededError class so a prompt that overflows a small-context upstream gets re-routed to a 128k+ model automatically.
43 keys were tested by tools/verify_all_keys.py (re-runnable any time to re-audit). Of those, 27 are alive and wired into the rotation. The rest are listed with their status so you know what you have versus what's actively serving traffic.
| Provider | Model | Why it's "free" | Notes |
|---|---|---|---|
| Groq | llama-3.1-8b-instant | Free tier (RPD cap) | Sub-second, tool-call safe |
| NVIDIA NIM ร2 keys | meta/llama-3.1-8b-instruct | Generous free credits | Primary + alt key rotated |
| Gemini ร2 keys | gemini-flash-latest | Free tier (20 req/day/key) | 1M context โ daily cap hits fast under load |
| GitHub Models ร2 keys | gpt-4o-mini (via Azure OpenAI compat) | Free with GitHub PAT | Per-tier 8k token request cap on some accounts |
| Fireworks ร2 keys | kimi-k2p5 (free model) | Free model on free tier | Fast, 128k context |
| DeepInfra (alt key only) | Meta-Llama-3.1-8B-Instruct | Free tier | Primary key dead, alt works |
| Nous ร2 keys | deepseek/deepseek-v4-flash:free | Per Nous portal: only this :free model is free | Other Nous models are paid |
| Mistral ร3 keys | mistral-small-latest | Free tier | 32k context, tool-call safe |
| OpenRouter | inclusionai/ring-2.6-1t | :free variant on OpenRouter | 1-trillion-param reasoning, 262k context |
| OFOX | z-ai/glm-4.7-flash:free | :free variant | Chinese gateway, OpenAI-compat |
| LLM7 | gpt-4o-mini-2024-07-18 | Free tier | Public free LLM endpoint |
| Cerebras | llama3.1-8b | Free tier | Sub-second; rejects tool schemas with minItems/maxItems |
| Together ร2 keys | Meta-Llama-3-8B-Instruct-Lite | Lite variant is free | 8k context โ fallback handles overflow |
| Bluesmind | meta/llama-3.1-8b-instruct | Free tier on aggregator | 160 models total, most are paid-tier-restricted |
| Hypereal | gpt-5.5-instant | Was free post-top-up | Will be moved to paid-mode if billing changes |
| Provider | Model | Roughly costs | Notes |
|---|---|---|---|
| Anthropic ร2 keys | claude-haiku-4-5 | Per-token, low | Native Anthropic API via /v1/messages |
| DeepSeek (native) | deepseek-chat | Per-token, very low | Direct DeepSeek API |
| Moonshot (native) | moonshot-v1-8k | Per-token | Direct Kimi API |
| AIMLAPI (paid key) | gpt-4o-mini | $20 cap | Free AIMLAPI key is permanently exhausted |
| Hypereal ร2 keys | gpt-5.5-instant | Per-token from credit pool | Top up at hypereal.cloud as needed |
| OpenAI (direct) | gpt-4o-mini | Per-token | Currently out of quota โ auto-uses if topped up |
| Provider | Status | How to fix |
|---|---|---|
| xAI Grok | Key reports as invalid per xAI's own response | Regenerate at console.x.ai |
| Cloudflare Workers AI | Daily 10k-neuron quota exhausted | Auto-resets at UTC midnight; available via cloudflare-llama direct alias |
| HuggingFace (all 3 tokens) | Account-pool monthly credits depleted | Wait for next month, or subscribe to PRO |
| Ollama Cloud | Key is an SSH-ed25519 public key โ needs JWT signing, not bearer auth | Would require a custom signing shim โ not implemented |
| Qwen (DashScope) | Key invalid per provider | Re-issue key in Alibaba console |
| Chutes | Account balance $0 | Top up |
| Inception | Models locked to accounts created before a cutoff date | Not fixable on this account |
| AIMLAPI free key | ALL_TIME_LIMIT reached (permanent) | Paid AIMLAPI key works and is in paid-mode |
| Cursor / Kilocode / Opencode | CLI auth tokens, not HTTPS chat APIs | Use the corresponding /consult-* slash commands directly instead |
Bypass rotation by using one of these model names โ useful for testing one provider in isolation:
cloudflare-llama โ direct CF Workers AInvidia-deepseek-v4-pro โ direct NVIDIA NIM with DeepSeekopenrouter-ring-1t โ direct OpenRouter Ring 2.6 1Tclaude-haiku-direct โ direct Anthropic Claude Haiku 4.5deepseek-chat-direct โ direct DeepSeek native| Command | What it does |
|---|---|
/startvllmp (alias /STARTVLLMP) | Start proxy in background. Refuses if already running. |
/restartvllmp (alias /RESTARTVLLMP) | Kill existing + relaunch. Use after editing config or rotating a key. |
/stopvllmp | Stop the proxy. |
/statusvllmp | PID, health, key load count, registered model groups, recent log tail. |
/modevllmp | The mode dashboard. Per-group: UPSTREAMS / HEALTHY / COOLED / 60-min-req-count / STATUS. Tells you at a glance whether free or paid is degraded. |
/testvllmp [N] | Fire N test requests, report distinct upstreams hit (proves rotation works). |
/tailvllmp [N] | Tail the proxy log with rotation / cooldown / 429 lines highlighted. |
All commands live in .claude/commands/ and are normal markdown โ feel free to edit.
(1) Keys live outside the repo, in a gitignored file in the user's home directory.
(2) The proxy launcher reads keys at startup and exports them into the process env before launching LiteLLM. Keys never appear in the YAML config โ only env-var references like os.environ/PROVIDER_API_KEY.
(3) The keys file is file-first when loading โ stale shell env vars from old sessions cannot silently override the canonical file values. This caught a bug where old GROQ_API_KEY values were causing 403s on live providers.
~/dbpasses.txt โ gitignored, outside the repo. Owner-readable only.NVIDIA:, GROQ FREE KEY:, OPEN ROUTER API KEY) followed by the key value on the next non-blank line.=====).litellm_config.yaml โ only os.environ/... references.0.0.0.0:4000 by default โ if you need network isolation, run with --host 127.0.0.1 or restrict at the OS firewall level).Whenever you add or rotate a key, re-run the audit:
python3 tools/verify_all_keys.py
The harness pings every provider's chat-completions endpoint with “1+1?” in parallel and reports OK / DEAD / NOFUNDS / QUOTA / NOKEY / CONFIG per key. Sends a real browser User-Agent so Cloudflare-fronted providers (Groq, Together, Cerebras, AIMLAPI, LLM7) don't return their bot-protection 403.
Every upstream failure goes through tools/litellm_smart_cooldown.py, a custom LiteLLM logger that:
daily_quota_cf ยท monthly_quota_hf ยท dead_key ยท rate_limit ยท server_error ยท bad_request ยท other/tmp/litellm_cooldown_state.json with cool_until_utc, hit count, last error message, last seen timeInspect anytime:
jq . /tmp/litellm_cooldown_state.json
LiteLLM's router still enforces its own static 300s cooldown_time internally โ the smart callback gives the operator visibility into why something is parked and when it should recover, but dynamic per-deployment enforcement of those long unban windows would require monkey-patching router internals that change across LiteLLM versions. So the recorded categories are informational; actual re-try timing follows the static 300s.
Usually means the proxy is restarting. Wait 10 seconds and retry, or check /statusvllmp to confirm PID is alive.
One of the upstreams in the rotation has a small per-request cap (e.g. GitHub Models gpt-4o-mini on certain tiers is 8k tokens). The proxy auto-falls back to a long-context group via the generic fallbacks: rules. If you still see the error, the large group is also exhausted โ switch to paid-mode-large manually.
An upstream started streaming and then hit a rate limit mid-stream. LiteLLM can't always recover from this cleanly. Usually it's Gemini hitting its 20 req/day free-tier cap. The next call will rotate to a different upstream. To make this less common, top up Gemini's billing or rely on paid-mode.
Means the launcher couldn't find that provider's label in ~/dbpasses.txt. Either the label string doesn't match exactly, the key file has unexpected indentation, or the key block is structured differently than expected (e.g. label-then-sub-label-then-value). Open tools/start_litellm_proxy.sh and check the parse_key line for that provider against the actual file content.
Check jq . /tmp/litellm_cooldown_state.json for the category field:
dead_key โ regenerate the key at the provider's consolemonthly_quota_hf โ wait for first of next month or paydaily_quota_cf โ waits until UTC midnight automaticallyrate_limit โ transient, should clear on its own~/dbpasses.txt with a clear label.parse_key "LABEL FROM FILE" ENV_VAR_NAME line to tools/start_litellm_proxy.sh.tools/verify_all_keys.py for the audit harness (label, env var, test function).model_list entry to litellm_config.yaml under the appropriate group (free-mode if zero-cost, paid-mode if billed)./restartvllmp and confirm with /modevllmp.