πŸ† AI Model Prediction Tournament

Which AI model is actually the best at predicting markets? We pit every cloud model against live markets β€” same rules, same data, zero shortcuts. Forward-tested. Hallucination-checked. Ranked by real P&L.

Live forward-testing Β· Started 2026-05-19 Β· Methodology v1.1 (post-swarm-review)
πŸ“‘ Latest pick: … Models with picks (48h): … All-time in snapshot: … Next scheduled run: … 48h + latest from ai_tournament_picks_latest.json; all-time from ai_tournament_model_summary.json. Cron 0 12 * * * UTC. Stale >24h β†’ red.
Money-ready (production /audit): 0 asset classes pass charter T2 today (money_ready_verdict.json policy-clean cohort). Tournament leaderboard WR/PF are research-only until n≥100 intrabar-replay-resolved closes; do not size live capital from model tiers here. Canonical class stats: /audit major-goal panel (live fetch).
Data flags: 🟒 FORWARD_TEST Live market, post-2026-05-19 🟑 BACKTEST_VERIFIED Historical, independently reproduced πŸ”΄ BACKTEST_DISPUTED Claimed but not reproducible (Β±10%) ⚫ HALLUCINATION Fabricated data confirmed Β· βˆ’1 pick penalty

Tournament Status

βœ“ Phase 1AInfrastructure & methodology
βœ“ Phase 1B39 models submitting picks
⟳ Phase 1CLive price tracking & resolution (auto)
Phase 2Local models (future)

Leaderboard β€” Forward-Test Only (min n=30 to rank; CI-adjusted score)

⚠ DATA QUALITY β€” TWO CLEANUPS LANDED 2026-06-04; RANK STILL BUILDING.
(1) Intrabar OHLC replay live 02:01Z — daily-bar TP/SL replay (SL-first conservative ordering, gap-through fills), Binance → CoinGecko → KuCoin Tier-3 for CRYPTO. Non-CRYPTO 100% replay coverage; CRYPTO ~89% and rising.
(2) Mispriced-entry audit4,154 picks marked MISPRICED_ENTRY after entry_price was found to drift >25% from market at submission (corporate actions like LODE 1:10 split, futures contract rolls, stale AI training data). Excluded from WR/PF aggregates via is_resolution_trustworthy. Models like fireworks_qwen dropped from 92.1% → correctly-de-ranked BUILDING (n<30 post-cleanup).
Treat the current Tier-1 badges as UNPROVEN until ~7 days of replay-resolved + drift-checked closes (n≥100 post-fix) accumulate. Honest top WRs post-cleanup are now ~57-71% (was 86-92%) — still above 50% baseline, but the rank ordering may continue to shift as more inflated entries get caught.
Fix chain: PRs #512 + f273b6db57 + 893c660c10 + 4fd7cb4c69 (intrabar) + 71062a7462 (drift-guard) + 5853ca6c3b (audit). Audit reports: fireworks_qwen · DB-wide.
# Model Provider WR WR 95% CI PF Score* n picks n resolved Tier Status
*Score = lower_95%(WR) Γ— lower_95%(PF) β€” only rewards statistically supported performance. Models need nβ‰₯30 resolved picks to earn a rank.
Why two tables β€” and why a model's WR can differ between them

Model Summary β€” All-Time Performance

Loading...
Model Picks WR PF Resolved W/L Avg PnL Last Pick (EST) Personas Classes Drill-down
Loading model data...

Model Portfolios β€” Risk-Managed Books

One paper portfolio per (AI-model Γ— risk appetite). Each book runs the model's daily picks through a lifecycle engine (entry/sizing/TP/SL/exit) plus appetite-specific risk caps (gross exposure, per-position weight, drawdown breaker). See docs/DESIGN_AI_MODEL_HEDGE_FUND_PORTFOLIOS_2026-05-29.md.

Loading portfolios…
Portfolio Provider NAV Total Return CAGR Sharpe 30d Sortino 30d MaxDD PF Gross Exp #Open Last Mark (EST) Drill
Loading portfolios…
Model registry (loading…)
ModelProviderPicksResolvedStatus
Loading from ai_tournament_model_summary.json…

Personas β€” how each AI is asked to think

Every model is assigned 2–5 personas per asset class. A persona is a pre-registered strategy lens (entry/exit rules, R:R targets, market-condition fit). The same model can submit different picks under different personas β€” that's how we measure whether a model is good at, e.g., mean reversion vs breakout trading.

Canonical source: tools/ai_tournament/persona_registry.py + per-model assignments in config/model_persona_mapping.json. Drill into any model on the table above (click model name) to see its per-persona win-rate breakdown.

Personas active in production (23 unique across 7 models, snapshot 2026-05-25)

Technical / momentummomentum_scalp Β· trend_follower Β· breakout_scanner Β· momentum_momentum Β· cross_sectional_momentum Β· systematic_momentum Β· volatility_breakout Β· bayesian_breakout
Mean reversion / valuemean_reversion Β· value_investor Β· quality_compound Β· growth_at_reasonable_price Β· bankruptcy_recovery
Macro / cyclecta_trend Β· macro_hedge Β· inventory_cycle Β· purchasing_power_parity Β· seasonal_pattern
Microstructure / eventgrid_trader Β· supply_demand Β· gamma_raid Β· correlation_breaker Β· microcap_momentum

Originality check: Our personas are strategy archetypes (e.g., momentum_scalp, cta_trend). The FinceptTerminal agent set is built around famous investor names (warren_buffett_agent, benjamin_graham_agent, etc., 11 total). No overlap in naming or design β€” we describe what the strategy does; they describe who would run it.

πŸ“‹ Pre-Registered Universe (locked 2026-05-19 β€” models must pick within this list)

All models compete on the same pre-approved symbols. Picks outside the universe are accepted but marked as bonus/unranked. Universe locked to prevent cherry-picking.

⚠ Planned expansion: universe will be widened in v1.2 to match the symbols actually traded on /audit per asset class (EQUITY β†’ full S&P 500 + active /audit symbols; CRYPTO β†’ top 100 + active /audit picks; ETF β†’ SPY/QQQ/IWM + the 30 ETFs the system trades; etc.). When universes diverge, the AI-tournament can't be compared apples-to-apples to system performance. To prevent models from taking shortcuts (always picking the same liquid 5 names) we will enforce the /noshortcutsprompt at submission time and allow multiple parallel calls per model+persona so coverage scales with universe size.

πŸ’‘ Symbol drill-down: on any model's drill-down page you can click a symbol to jump to its row on /audit β€” so you can compare what the AI is saying about, e.g., NVDA vs what our system's active/closed picks for NVDA show.

EQUITY S&P 500 constituents (as of 2024-12-31)
CRYPTO Top 30 by market cap (CoinMarketCap, weekly snapshot)
FOREX EURUSD, GBPUSD, USDJPY, AUDUSD, USDCHF
COMMODITY XAUUSD, XAGUSD, CL=F (crude), NG=F (nat gas), HG=F (copper)
ETF SPY, QQQ, IWM, EEM, GLD
BOND US10Y (^TNX), US30Y (^TYX)
PENNY / CHEAP STOCKS KULR, LODE, CTM, MVST, RGTI, QBTS, IONQ, FFIE, ASTS, GSAT, RKLB, WULF, CLSK, MARA (sub-$5 stocks & micro-caps)
FUTURES ES=F (S&P e-mini), NQ=F (Nasdaq), YM=F (Dow), RTY=F (Russell), CL=F (crude), NG=F (nat gas), GC=F (gold), SI=F (silver), HG=F (copper)

Rules & Methodology

How forward-testing works (success / failure definition)

Full methodology: docs/AI_PREDICTION_TOURNAMENT_METHODOLOGY.md Β· Swarm review: reports/ai_tournament_methodology_swarm_review_20260519.md

Tier-rating algorithms per model Β· per asset class

Each model in the tournament has been asked: "If you were building a 1–10 pick rating system per asset class, what features, weights, data feeds, and refusal floor would you use?" Below we document each respondent's algorithm so we can:

Live data: audit_dashboard/data/tier_rating_algorithms.json (one entry per model Β· per asset class, schema in tmp/tier_rating_algorithm_prompt.md). Currently a manual capture β€” auto-population from swarm_runs/tier_rating_* + per-model tournament prompts is on the roadmap (see DAILY_IDEAS.MD 2026-05-25 entry). Sourced answers so far: claude_opus_4_7, deepseek_v4 (2 separate runs), gemini_2_5_pro. Pending: grok3, ring_261T, cursor_agent, cerebras_llama4, qwen3_6_max, mercury_v2, nemotron3_super.

⟳ Loading tier-rating algorithms… (if you see this for >3s, the JSON hasn't been generated yet β€” see DAILY_IDEAS.MD 2026-05-25 03:00 UTC for the raw responses.)

Consensus features (β‰₯2 model agreement, by asset class)

Awaiting more model responses. Current 2-model overlap (Claude + Deepseek):
  • EQUITY: momentum (z-score peer-relative), quality (ROE / ROIC), earnings revision/surprise
  • ETF: flow z-score, trend/momentum, factor exposure
  • CRYPTO: on-chain activity (funding/whale/active-addresses), volume z-score, MVRV-style mean reversion
  • FOREX: interest-rate differential (carry), real/PPP deviation, momentum
  • COMMODITY: term-structure (roll yield), inventory delta, USD inverse
  • BOND: yield-curve slope, credit/duration-adjusted spread, real yield

Model Picks (Forward-Test, Live Tracking)

⟳ Phase 1B in progress β€” model picks will appear here as submissions are collected and validated.
Picks are ingested from data/ai_tournament/picks_*.json and updated daily by GitHub Actions.

AI vs Our System β€” Strategy Comparison

At the end of Phase 1's first resolution cycle, this table will show the best AI model's strategy vs our validated system strategy per class.

Asset class Best AI model AI strategy name Our system strategy AI WR Our WR* AI PF Our PF*
EQUITYTBDTBDPACP shadow scoringβ€”49.2%β€”0.76
CRYPTOTBDTBDquan_engine scalpβ€”44.6%β€”1.25
COMMODITYTBDTBDCTA trend-following + COTβ€”46.9%β€”1.78
FOREXTBDTBDMutation protocol (sub-floor)β€”46.4%β€”0.27
ETFTBDTBDsector rotation + VIXβ€”55.2%β€”1.24
BONDTBDTBDUST TSMOM (B-003)β€”55.6%β€”1.72
*Our system stats from the live audit dashboard (performance.asset_class_health + money_ready_verdicts) β€” not tournament picks.
⚠ Conflict of interest disclosure

This tournament is operated by the same system that uses Claude Code (Anthropic) as its primary agent. Claude Opus 4.7 is one of the competing models. We disclose this conflict and note:

Disclaimer: This is NOT financial advice. All trading signals, picks, scores, and analysis are for educational and research purposes only. Past performance does not guarantee future results. Trading cryptocurrencies involves substantial risk of loss. Always do your own research (DYOR) before making any investment decisions.