Which AI model is actually the best at predicting markets? We pit every cloud model against live markets β same rules, same data, zero shortcuts. Forward-tested. Hallucination-checked. Ranked by real P&L.
Live forward-testing Β· Started 2026-05-19 Β· Methodology v1.1 (post-swarm-review)
π‘ Latest pick:β¦Models with picks (48h):β¦All-time in snapshot:β¦Next scheduled run:β¦48h + latest from ai_tournament_picks_latest.json; all-time from ai_tournament_model_summary.json. Cron 0 12 * * * UTC. Stale >24h β red.
Money-ready (production /audit):0 asset classes pass charter T2 today
(money_ready_verdict.json policy-clean cohort). Tournament leaderboard WR/PF are research-only
until n≥100 intrabar-replay-resolved closes; do not size live capital from model tiers here.
Canonical class stats: /audit major-goal panel (live fetch).
Data flags:π’ FORWARD_TEST Live market, post-2026-05-19π‘ BACKTEST_VERIFIED Historical, independently reproducedπ΄ BACKTEST_DISPUTED Claimed but not reproducible (Β±10%)β« HALLUCINATION Fabricated data confirmed Β· β1 pick penalty
(2) Mispriced-entry audit — 4,154 picks marked MISPRICED_ENTRY after entry_price was found to drift >25% from market at submission (corporate actions like LODE 1:10 split, futures contract rolls, stale AI training data). Excluded from WR/PF aggregates via is_resolution_trustworthy. Models like fireworks_qwen dropped from 92.1% → correctly-de-ranked BUILDING (n<30 post-cleanup).
Treat the current Tier-1 badges as UNPROVEN until ~7 days of replay-resolved + drift-checked closes (n≥100 post-fix) accumulate. Honest top WRs post-cleanup are now ~57-71% (was 86-92%) — still above 50% baseline, but the rank ordering may continue to shift as more inflated entries get caught.
*Score = lower_95%(WR) Γ lower_95%(PF) β only rewards statistically supported performance. Models need nβ₯30 resolved picks to earn a rank.
Why two tables β and why a model's WR can differ between them
Leaderboard β Forward-Test Only (above): the vetted ranking. Includes only models with β₯30 trustworthy resolved picks, excludes impossible resolutions (resolved-before-submitted, or TP/SL on the wrong side of entry), and ranks by a CI-adjusted score = lower-95% WR Γ lower-95% PF (small samples are penalized). Answers "which model is statistically best."
Model Summary β All-Time Performance (below): the breadth view. Lists every model with its win-rate over all resolved picks (WIN/LOSS/EXPIRED). It applies the same impossible-resolution filter as the Leaderboard (so both rest on the same clean cohort), but has no minimum sample and no CI shrinkage. Answers "who is submitting, and how much."
Both tables now use the same clean resolution cohort, so a model's WR matches between them. The Leaderboard additionally requires nβ₯30 and ranks by the CI-shrunk score, so a strong-looking small-sample model in the Summary may be unranked or ranked lower on the Leaderboard. For "best model" decisions, trust the Leaderboard; treat a Model-Summary WR below n=30 as activity/breadth, not a proven edge. Note: "n picks" includes still-OPEN picks (~half the field) that don't count toward WR until they resolve.
Model Summary β All-Time Performance
Loading...
Model
Picks
WR
PF
Resolved
W/L
Avg PnL
Last Pick (EST)
Personas
Classes
Drill-down
Loading model data...
Model Portfolios β Risk-Managed Books
One paper portfolio per (AI-model Γ risk appetite). Each book runs the model's daily picks through a lifecycle engine (entry/sizing/TP/SL/exit) plus appetite-specific risk caps (gross exposure, per-position weight, drawdown breaker). See docs/DESIGN_AI_MODEL_HEDGE_FUND_PORTFOLIOS_2026-05-29.md.
Loading portfoliosβ¦
Portfolio
Provider
NAV
Total Return
CAGR
Sharpe 30d
Sortino 30d
MaxDD
PF
Gross Exp
#Open
Last Mark (EST)
Drill
Loading portfoliosβ¦
Model registry (loadingβ¦)
Model
Provider
Picks
Resolved
Status
Loading from ai_tournament_model_summary.jsonβ¦
Personas β how each AI is asked to think
Every model is assigned 2β5 personas per asset class. A persona is a pre-registered strategy lens (entry/exit rules, R:R targets, market-condition fit). The same model can submit different picks under different personas β that's how we measure whether a model is good at, e.g., mean reversion vs breakout trading.
Canonical source: tools/ai_tournament/persona_registry.py + per-model assignments in config/model_persona_mapping.json. Drill into any model on the table above (click model name) to see its per-persona win-rate breakdown.
Personas active in production (23 unique across 7 models, snapshot 2026-05-25)
Originality check: Our personas are strategy archetypes (e.g., momentum_scalp, cta_trend). The FinceptTerminal agent set is built around famous investor names (warren_buffett_agent, benjamin_graham_agent, etc., 11 total). No overlap in naming or design β we describe what the strategy does; they describe who would run it.
π Pre-Registered Universe (locked 2026-05-19 β models must pick within this list)
All models compete on the same pre-approved symbols. Picks outside the universe are accepted but marked as bonus/unranked. Universe locked to prevent cherry-picking.
β Planned expansion: universe will be widened in v1.2 to match the symbols actually traded on /audit per asset class (EQUITY β full S&P 500 + active /audit symbols; CRYPTO β top 100 + active /audit picks; ETF β SPY/QQQ/IWM + the 30 ETFs the system trades; etc.). When universes diverge, the AI-tournament can't be compared apples-to-apples to system performance. To prevent models from taking shortcuts (always picking the same liquid 5 names) we will enforce the /noshortcutsprompt at submission time and allow multiple parallel calls per model+persona so coverage scales with universe size.
π‘ Symbol drill-down: on any model's drill-down page you can click a symbol to jump to its row on /audit β so you can compare what the AI is saying about, e.g., NVDA vs what our system's active/closed picks for NVDA show.
EQUITYS&P 500 constituents (as of 2024-12-31)
CRYPTOTop 30 by market cap (CoinMarketCap, weekly snapshot)
How forward-testing works (success / failure definition)
Default holding window: 7 days (1 week) from submitted_at, unless the pick specifies a longer asset-class window (see windows below).
WIN β current price hits TP before SL within the window.
LOSS β current price hits SL before TP within the window.
FLAT (EXPIRED) β neither TP nor SL touched by window close. We mark the pick at the closing price; PnL counts but the pick does not earn a "win" badge.
OPEN picks show a live fwd-test status in the drill-down: β toward TP / β toward SL / β flat so far with current unrealized PnL.
Price source: yfinance (equities/ETFs/FX/futures), Binance + CoinGecko fallback (crypto). Resolver runs every ~20 min during sessions; nightly EOD sweep checks all open picks.
Forward-test only for ranking. Backtests recorded but not used for leaderboard position.
Resolution: TP/SL hit (first one) β win/loss. Expiry at closing price (not mid-price). Slippage: 5β10bps equity, 0.2% crypto.
Hallucination check: All backtest claims reproduced against real OHLC. Fabrication = β« + β1 pick penalty.
Max SL/TP ratio: 2.0 (SL β€ 2ΓTP). Min RR β₯ 1.5. Enforced at submission.
Model version pinned at tournament start for reproducibility.
No shortcuts prompt applied to every model β see /noshortcutsprompt skill.
Random-guess audit (v1.2, planned): after each submission, the same model is re-asked: "Are these picks based on live market data and news you can cite, or is at least one a random guess? List sources per pick. If you cannot cite a source, mark the pick as speculation." Picks the model self-flags as speculation are still tracked but excluded from the leaderboard score until they resolve.
Confidence tooltip: hover any HIGH/MED/LOW badge on the drill-down to see why the model chose that level (rating, market-supported flag, R:R, timeframe, data source).
Tier-rating algorithms per model Β· per asset class
Each model in the tournament has been asked: "If you were building a 1β10 pick rating system per asset class, what features, weights, data feeds, and refusal floor would you use?" Below we document each respondent's algorithm so we can:
Look for cross-model consensus on which features actually predict edge per class
Identify gaps where our production scoring on /audit is missing features all the models agree on
Spot model-specific biases (e.g., model X overweights momentum on every class β useful when reading their picks)
Live data: audit_dashboard/data/tier_rating_algorithms.json (one entry per model Β· per asset class, schema in tmp/tier_rating_algorithm_prompt.md). Currently a manual capture β auto-population from swarm_runs/tier_rating_* + per-model tournament prompts is on the roadmap (see DAILY_IDEAS.MD 2026-05-25 entry). Sourced answers so far: claude_opus_4_7, deepseek_v4 (2 separate runs), gemini_2_5_pro. Pending: grok3, ring_261T, cursor_agent, cerebras_llama4, qwen3_6_max, mercury_v2, nemotron3_super.
β³ Loading tier-rating algorithmsβ¦ (if you see this for >3s, the JSON hasn't been generated yet β see DAILY_IDEAS.MD 2026-05-25 03:00 UTC for the raw responses.)
Consensus features (β₯2 model agreement, by asset class)
Awaiting more model responses. Current 2-model overlap (Claude + Deepseek):
BOND: yield-curve slope, credit/duration-adjusted spread, real yield
Model Picks (Forward-Test, Live Tracking)
β³ Phase 1B in progress β model picks will appear here as submissions are collected and validated. Picks are ingested from data/ai_tournament/picks_*.json and updated daily by GitHub Actions.
AI vs Our System β Strategy Comparison
At the end of Phase 1's first resolution cycle, this table will show the best AI model's strategy vs our validated system strategy per class.
Asset class
Best AI model
AI strategy name
Our system strategy
AI WR
Our WR*
AI PF
Our PF*
EQUITY
TBD
TBD
PACP shadow scoring
β
49.2%
β
0.76
CRYPTO
TBD
TBD
quan_engine scalp
β
44.6%
β
1.25
COMMODITY
TBD
TBD
CTA trend-following + COT
β
46.9%
β
1.78
FOREX
TBD
TBD
Mutation protocol (sub-floor)
β
46.4%
β
0.27
ETF
TBD
TBD
sector rotation + VIX
β
55.2%
β
1.24
BOND
TBD
TBD
UST TSMOM (B-003)
β
55.6%
β
1.72
*Our system stats from the live audit dashboard (performance.asset_class_health + money_ready_verdicts) β not tournament picks.
β Conflict of interest disclosure
This tournament is operated by the same system that uses Claude Code (Anthropic) as its primary agent. Claude Opus 4.7 is one of the competing models. We disclose this conflict and note:
Claude's picks are evaluated by the same automated price-tracking system as all other models.
No human review of individual picks occurs between submission and resolution.
All pick data is stored in the public GitHub repository for independent verification.
The "no shortcuts" prompt is applied identically to every model including Claude.
Disclaimer: This is NOT financial advice. All trading signals, picks, scores, and analysis are for educational and research purposes only. Past performance does not guarantee future results. Trading cryptocurrencies involves substantial risk of loss. Always do your own research (DYOR) before making any investment decisions.