🏆 AI Model Prediction Tournament

Which AI model is actually the best at predicting markets? We pit every cloud model against live markets — same rules, same data, zero shortcuts. Forward-tested. Hallucination-checked. Ranked by real P&L.

Live forward-testing · Started 2026-05-19 · Methodology v1.1 (post-swarm-review)

Money-ready (production /audit): 0 asset classes pass charter T2 today (money_ready_verdict.json policy-clean cohort). Tournament leaderboard WR/PF are research-only until n≥100 intrabar-replay-resolved closes; do not size live capital from model tiers here. Canonical class stats: /audit major-goal panel (live fetch).

Data flags: 🟢 FORWARD_TEST Live market, post-2026-05-19 🟡 BACKTEST_VERIFIED Historical, independently reproduced 🔴 BACKTEST_DISPUTED Claimed but not reproducible (±10%) ⚫ HALLUCINATION Fabricated data confirmed · −1 pick penalty

Tournament Status

✓ Phase 1AInfrastructure & methodology

✓ Phase 1B39 models submitting picks

⟳ Phase 1CLive price tracking & resolution (auto)

Phase 2Local models (future)

Leaderboard — Forward-Test Only (min n=30 to rank; CI-adjusted score)

⚠ DATA QUALITY — TWO CLEANUPS LANDED 2026-06-04; RANK STILL BUILDING.

(1) Intrabar OHLC replay live 02:01Z — daily-bar TP/SL replay (SL-first conservative ordering, gap-through fills), Binance → CoinGecko → KuCoin Tier-3 for CRYPTO. Non-CRYPTO 100% replay coverage; CRYPTO ~89% and rising.

(2) Mispriced-entry audit — 4,154 picks marked MISPRICED_ENTRY after entry_price was found to drift >25% from market at submission (corporate actions like LODE 1:10 split, futures contract rolls, stale AI training data). Excluded from WR/PF aggregates via is_resolution_trustworthy. Models like fireworks_qwen dropped from 92.1% → correctly-de-ranked BUILDING (n<30 post-cleanup).

Treat the current Tier-1 badges as UNPROVEN until ~7 days of replay-resolved + drift-checked closes (n≥100 post-fix) accumulate. Honest top WRs post-cleanup are now ~57-71% (was 86-92%) — still above 50% baseline, but the rank ordering may continue to shift as more inflated entries get caught.

Fix chain: PRs #512 + f273b6db57 + 893c660c10 + 4fd7cb4c69 (intrabar) + 71062a7462 (drift-guard) + 5853ca6c3b (audit). Audit reports: fireworks_qwen · DB-wide.

#	Model	Provider	WR	WR 95% CI	PF	Score*	n picks	n resolved	Tier	Status

*Score = lower_95%(WR) × lower_95%(PF) — only rewards statistically supported performance. Models need n≥30 resolved picks to earn a rank.

Why two tables — and why a model's WR can differ between them

Leaderboard — Forward-Test Only (above): the vetted ranking. Includes only models with ≥30 trustworthy resolved picks, excludes impossible resolutions (resolved-before-submitted, or TP/SL on the wrong side of entry), and ranks by a CI-adjusted score = lower-95% WR × lower-95% PF (small samples are penalized). Answers "which model is statistically best."
Model Summary — All-Time Performance (below): the breadth view. Lists every model with its win-rate over all resolved picks (WIN/LOSS/EXPIRED). It applies the same impossible-resolution filter as the Leaderboard (so both rest on the same clean cohort), but has no minimum sample and no CI shrinkage. Answers "who is submitting, and how much."
Both tables now use the same clean resolution cohort, so a model's WR matches between them. The Leaderboard additionally requires n≥30 and ranks by the CI-shrunk score, so a strong-looking small-sample model in the Summary may be unranked or ranked lower on the Leaderboard. For "best model" decisions, trust the Leaderboard; treat a Model-Summary WR below n=30 as activity/breadth, not a proven edge. Note: "n picks" includes still-OPEN picks (~half the field) that don't count toward WR until they resolve.

Model Summary — All-Time Performance

Model	Picks	WR	PF	Resolved	W/L	Avg PnL	Last Pick (EST)	Personas	Classes	Drill-down
Loading model data...

Model Portfolios — Risk-Managed Books

One paper portfolio per (AI-model × risk appetite). Each book runs the model's daily picks through a lifecycle engine (entry/sizing/TP/SL/exit) plus appetite-specific risk caps (gross exposure, per-position weight, drawdown breaker). See docs/DESIGN_AI_MODEL_HEDGE_FUND_PORTFOLIOS_2026-05-29.md.

Loading portfolios…

Portfolio	Provider	NAV	Total Return	CAGR	Sharpe 30d	Sortino 30d	MaxDD	PF	Gross Exp	#Open	Last Mark (EST)	Drill
Loading portfolios…

Model registry (loading…)

Model	Provider	Picks	Resolved	Status
Loading from `ai_tournament_model_summary.json`…

Personas — how each AI is asked to think

Every model is assigned 2–5 personas per asset class. A persona is a pre-registered strategy lens (entry/exit rules, R:R targets, market-condition fit). The same model can submit different picks under different personas — that's how we measure whether a model is good at, e.g., mean reversion vs breakout trading.

Canonical source: tools/ai_tournament/persona_registry.py + per-model assignments in config/model_persona_mapping.json. Drill into any model on the table above (click model name) to see its per-persona win-rate breakdown.

Personas active in production (23 unique across 7 models, snapshot 2026-05-25)

Technical / momentummomentum_scalp · trend_follower · breakout_scanner · momentum_momentum · cross_sectional_momentum · systematic_momentum · volatility_breakout · bayesian_breakout

Mean reversion / valuemean_reversion · value_investor · quality_compound · growth_at_reasonable_price · bankruptcy_recovery

Macro / cyclecta_trend · macro_hedge · inventory_cycle · purchasing_power_parity · seasonal_pattern

Microstructure / eventgrid_trader · supply_demand · gamma_raid · correlation_breaker · microcap_momentum

Originality check: Our personas are strategy archetypes (e.g., momentum_scalp, cta_trend). The FinceptTerminal agent set is built around famous investor names (warren_buffett_agent, benjamin_graham_agent, etc., 11 total). No overlap in naming or design — we describe what the strategy does; they describe who would run it.

📋 Pre-Registered Universe (locked 2026-05-19 — models must pick within this list)

All models compete on the same pre-approved symbols. Picks outside the universe are accepted but marked as bonus/unranked. Universe locked to prevent cherry-picking.

⚠ Planned expansion: universe will be widened in v1.2 to match the symbols actually traded on /audit per asset class (EQUITY → full S&P 500 + active /audit symbols; CRYPTO → top 100 + active /audit picks; ETF → SPY/QQQ/IWM + the 30 ETFs the system trades; etc.). When universes diverge, the AI-tournament can't be compared apples-to-apples to system performance. To prevent models from taking shortcuts (always picking the same liquid 5 names) we will enforce the /noshortcutsprompt at submission time and allow multiple parallel calls per model+persona so coverage scales with universe size.

💡 Symbol drill-down: on any model's drill-down page you can click a symbol to jump to its row on /audit — so you can compare what the AI is saying about, e.g., NVDA vs what our system's active/closed picks for NVDA show.

EQUITY S&P 500 constituents (as of 2024-12-31)

CRYPTO Top 30 by market cap (CoinMarketCap, weekly snapshot)

FOREX EURUSD, GBPUSD, USDJPY, AUDUSD, USDCHF

COMMODITY XAUUSD, XAGUSD, CL=F (crude), NG=F (nat gas), HG=F (copper)

ETF SPY, QQQ, IWM, EEM, GLD

BOND US10Y (^TNX), US30Y (^TYX)

PENNY / CHEAP STOCKS KULR, LODE, CTM, MVST, RGTI, QBTS, IONQ, FFIE, ASTS, GSAT, RKLB, WULF, CLSK, MARA (sub-$5 stocks & micro-caps)

FUTURES ES=F (S&P e-mini), NQ=F (Nasdaq), YM=F (Dow), RTY=F (Russell), CL=F (crude), NG=F (nat gas), GC=F (gold), SI=F (silver), HG=F (copper)

Rules & Methodology

How forward-testing works (success / failure definition)

Default holding window: 7 days (1 week) from submitted_at, unless the pick specifies a longer asset-class window (see windows below).
WIN — current price hits TP before SL within the window.
LOSS — current price hits SL before TP within the window.
FLAT (EXPIRED) — neither TP nor SL touched by window close. We mark the pick at the closing price; PnL counts but the pick does not earn a "win" badge.
OPEN picks show a live fwd-test status in the drill-down: ↗ toward TP / ↘ toward SL / → flat so far with current unrealized PnL.
Price source: yfinance (equities/ETFs/FX/futures), Binance + CoinGecko fallback (crypto). Resolver runs every ~20 min during sessions; nightly EOD sweep checks all open picks.

Forward-test only for ranking. Backtests recorded but not used for leaderboard position.
Resolution: TP/SL hit (first one) → win/loss. Expiry at closing price (not mid-price). Slippage: 5–10bps equity, 0.2% crypto.
Windows: EQUITY 30d · CRYPTO 14d · COMMODITY 28d · FOREX 21d · ETF 30d · BOND 60d
Min n=30 resolved picks per model per asset class before ranking.
Score = lower_95%(WR) × lower_95%(PF) — only statistically supported performance counts.
Hallucination check: All backtest claims reproduced against real OHLC. Fabrication = ⚫ + −1 pick penalty.
Max SL/TP ratio: 2.0 (SL ≤ 2×TP). Min RR ≥ 1.5. Enforced at submission.
Model version pinned at tournament start for reproducibility.
No shortcuts prompt applied to every model — see /noshortcutsprompt skill.
Random-guess audit (v1.2, planned): after each submission, the same model is re-asked: "Are these picks based on live market data and news you can cite, or is at least one a random guess? List sources per pick. If you cannot cite a source, mark the pick as speculation." Picks the model self-flags as speculation are still tracked but excluded from the leaderboard score until they resolve.
Confidence tooltip: hover any HIGH/MED/LOW badge on the drill-down to see why the model chose that level (rating, market-supported flag, R:R, timeframe, data source).

Full methodology: docs/AI_PREDICTION_TOURNAMENT_METHODOLOGY.md · Swarm review: reports/ai_tournament_methodology_swarm_review_20260519.md

Tier-rating algorithms per model · per asset class

Each model in the tournament has been asked: "If you were building a 1–10 pick rating system per asset class, what features, weights, data feeds, and refusal floor would you use?" Below we document each respondent's algorithm so we can:

Look for cross-model consensus on which features actually predict edge per class
Identify gaps where our production scoring on /audit is missing features all the models agree on
Spot model-specific biases (e.g., model X overweights momentum on every class — useful when reading their picks)

Live data: audit_dashboard/data/tier_rating_algorithms.json (one entry per model · per asset class, schema in tmp/tier_rating_algorithm_prompt.md). Currently a manual capture — auto-population from swarm_runs/tier_rating_* + per-model tournament prompts is on the roadmap (see DAILY_IDEAS.MD 2026-05-25 entry). Sourced answers so far: claude_opus_4_7, deepseek_v4 (2 separate runs), gemini_2_5_pro. Pending: grok3, ring_261T, cursor_agent, cerebras_llama4, qwen3_6_max, mercury_v2, nemotron3_super.

⟳ Loading tier-rating algorithms… (if you see this for >3s, the JSON hasn't been generated yet — see DAILY_IDEAS.MD 2026-05-25 03:00 UTC for the raw responses.)

Consensus features (≥2 model agreement, by asset class)

Awaiting more model responses. Current 2-model overlap (Claude + Deepseek):

EQUITY: momentum (z-score peer-relative), quality (ROE / ROIC), earnings revision/surprise
ETF: flow z-score, trend/momentum, factor exposure
CRYPTO: on-chain activity (funding/whale/active-addresses), volume z-score, MVRV-style mean reversion
FOREX: interest-rate differential (carry), real/PPP deviation, momentum
COMMODITY: term-structure (roll yield), inventory delta, USD inverse
BOND: yield-curve slope, credit/duration-adjusted spread, real yield

Model Picks (Forward-Test, Live Tracking)

⟳ Phase 1B in progress — model picks will appear here as submissions are collected and validated.
Picks are ingested from data/ai_tournament/picks_*.json and updated daily by GitHub Actions.

AI vs Our System — Strategy Comparison

At the end of Phase 1's first resolution cycle, this table will show the best AI model's strategy vs our validated system strategy per class.

Asset class	Best AI model	AI strategy name	Our system strategy	AI WR	Our WR*	AI PF	Our PF*
EQUITY	TBD	TBD	PACP shadow scoring	—	49.2%	—	0.76
CRYPTO	TBD	TBD	quan_engine scalp	—	44.6%	—	1.25
COMMODITY	TBD	TBD	CTA trend-following + COT	—	46.9%	—	1.78
FOREX	TBD	TBD	Mutation protocol (sub-floor)	—	46.4%	—	0.27
ETF	TBD	TBD	sector rotation + VIX	—	55.2%	—	1.24
BOND	TBD	TBD	UST TSMOM (B-003)	—	55.6%	—	1.72

*Our system stats from the live audit dashboard (performance.asset_class_health + money_ready_verdicts) — not tournament picks.

⚠ Conflict of interest disclosure

This tournament is operated by the same system that uses Claude Code (Anthropic) as its primary agent. Claude Opus 4.7 is one of the competing models. We disclose this conflict and note:

Claude's picks are evaluated by the same automated price-tracking system as all other models.
No human review of individual picks occurs between submission and resolution.
All pick data is stored in the public GitHub repository for independent verification.
The "no shortcuts" prompt is applied identically to every model including Claude.

Disclaimer: This is NOT financial advice. All trading signals, picks, scores, and analysis are for educational and research purposes only. Past performance does not guarantee future results. Trading cryptocurrencies involves substantial risk of loss. Always do your own research (DYOR) before making any investment decisions.