๐Ÿค– AI Hedge Fund Simulation โ€” Full Report

May 24, 2026 ยท 10-round AI debate ยท 22 forward-test picks ยท 15+ AI models ยท Data: tournament_picks (MySQL)
โš ๏ธ HONEST VERDICT: The simulation built real infrastructure but is NOT ready for real money. The confidence engine is inverted (higher confidence = lower win rate), 2 of 9 asset classes have zero resolved data, and only 2 of 22 picks have verified forward-test track records. The debate pipeline itself produced more value than the strategies โ€” catching the confidence epidemic, duplicate entries, and n-threshold gaps.

1. Methodology

Data Source

All picks are queried from tournament_picks table (MySQL: ejaguiar1_stocks, 3,149 rows, 34 AI models, 9 asset classes). Only forward-test OPEN picks are used. Synthetic/backtest data (SYNTHETIC_SEED_ENRICHED, BACKTEST_VERIFIED) and KILLED personas (ml_pattern, relative_strength, dividend_compound) are excluded. FOREX is blocked by kill gate (57.3% WR, -0.39% avg PnL).

Confidence Assignment

MethodDescriptionPicks
PERSONA WRWhen persona has nโ‰ฅ20 resolved picks, confidence = persona win rate6 picks (PG 64%, SOL 65%, TLT/SPY/GLD/SHY 62.5%)
MODEL-REPORTEDPick comes from model that reported its own confidence (HIGH/MEDIUM/LOW string or 0-1 float)7 picks (MSFT 30%, XOM 30%, CL=F 80%)
IMPUTEDNo WR and no model-reported confidence โ€” coin-flip assumption9 picks (SI=F 50%, PENNY/FUTURES 0%)

Ranking Formula

Composite Score = Confidence ร— WR ร— RR ร— ln(n+1), normalized per asset class. Picks with n=0 score zero by definition.

Debate Process

Round 1 (7 models): Risk Manager + Portfolio Manager + Cerebras GPT-OSS-120B + DeepSeek V4 + KiloCode + Kimi + Cursor โ€” debated which picks are safest, which to veto. Produced consensus top 5 and systemic issue list.

Round 2 (2 models): Multi-Asset Allocator + Financial Data Architect โ€” expanded to all 9 asset classes, identified IPO/mutual fund infrastructure gaps.

Round 3 (3 models): Quant Researcher (EV/Sharpe/Kelly) + Behavioral Analyst (market narrative) + Hedge Fund PM ($500k AUM allocation). Produced 8-position risk-parity portfolio.

Round 4: IPO Lockup Expiry Strategy โ€” SHORT 30 days before 180-day lockup expiry. Currently data-starved (needs live SEC EDGAR scraper).

Rounds 5-14 (10 agents): Pick-by-pick review, cross-round pattern analysis, devil's advocate audit, entry criteria standardization, data gap ranking, statistical edge recalculation, model attribution scorecard, final executive synthesis.

2. Results Per Asset Class

EQUITY 3 TOP PICKS

RankPickDirectionEntryWRnRRConfStatus
1METALONG$620.7160%1241.750%PROVEN
2PGSHORT$167.3764%1641.564%VERIFIED
3GOOGLLONG$186.6360%1242.530%ESTIMATED

Best equity edge: PG SHORT โ€” only verified-WR equity pick (64%, n=164). META LONG is the AI monetization flywheel play. GOOGL has highest RR (2.5) but low confidence (30%). Gap: UEPS fundamental screen shows ADBE (Score 0.839) as theoretically highest-quality but has 0 forward-test data in tournament_picks.

CRYPTO 2 TOP PICKS

RankPickDirectionEntryWRnRRConfStatus
1SOLUSDTLONG$157.3965%232.165%VERIFIED
2AVAXUSDTSHORT$22.8365%231.865%SMALL N

SOLUSDT LONG is the only crypto pick with verified WR (65%, n=23). vol_arb persona has only 23 resolved picks โ€” statistically insufficient (95% CI: ยฑ20%). Risk: Crypto shorts (AVAX, BTC, ETH) conflict directionally with SOL long โ€” if risk-on returns, all shorts get run over.

ETF 2 TOP PICKS

๐Ÿ’ก ELI5: SPY is the whole US stock market. Shorting it means you think stocks will go DOWN. GLD is gold โ€” people buy it when they're scared about inflation or war. Together they're betting that stocks fall and gold rises, which is the "stagflation" playbook.

RankPickDirectionEntryWRnRRConf
1SPYSHORT$726.8062.5%1241.962.5%
2GLDLONG$257.9362.5%1241.262.5%

SPY SHORT + GLD LONG = textbook stagflation pair. Both anchored by risk_parity persona (n=124). SPY SHORT is the highest composite score in the entire book. GLD has weakest RR (1.2) but diversification value.

BOND 3 PICKS

RankPickDirectionEntryWRnRRConf
1TLTLONG$87.6662.5%1241.862.5%
2SHYSHORT$82.3662.5%1241.662.5%

TLT LONG is the strongest consensus pick (7/7 models). SHY SHORT + TLT LONG = curve steepener. Risk: Bond Sharpe ratios inflated by low vol assumption (0.5% daily). Real bond strategies do not sustain Sharpe 19.

COMMODITY 3 PICKS

RankPickDirectionEntryWRnRRConf
1SI=FSHORT$34.9362.5%1241.450%
2CL=FSHORT$68.2565%231.180%

CL=F SHORT has near-zero EV (0.05 risk units) and RR=1.1 โ€” risk/reward barely above breakeven. Gap: COMMODITY pipeline may be broken โ€” top systems (multi_asset_cot, PF=4.72) have n=0 in resolved DB. All commodity picks are SHORT โ€” check for regime bias. SI=F SHORT replaced CL=F in the adjusted portfolio.

PENNY & FUTURES BLOCKED

WR=0%, n=0 for all 6 picks. MVST, KULR, QBTS (penny stocks) and ES=F, GC=F, CL=F (futures) have zero resolved data. No statistical basis for inclusion. These should be removed from active pick list until nโ‰ฅ50 resolved trades. Currently listed as PAPER ONLY โ€” no real capital allocation.

FOREX BLOCKED

Kill gate active: 57.3% WR, -0.39% avg PnL, 253 resolved picks. Statistical trap confirmed โ€” many small wins, occasional large losers (3.2:1 loss-to-win ratio). 63% of FOREX wins are 1-basis-point "resolver flicker." Zero allocation until asymmetric TP/SL fix is validated.

3. What's Broken (P0 โ€” Must Fix Before Real Money)

#IssueSeverityFix
1ML confidence INVERTED โ€” 0.85-0.90 band has 20% WRCRITICALFlip scoring: high confidence โ†’ sell signal. Recalibrate against realized outcomes.
2No n-threshold gate โ€” 0-data personas generating live signalsCRITICALRequire โ‰ฅ50 resolved trades per source before any signal passes.
3PENNY/FUTURES with 0 resolved dataCRITICALDrop from active pick list. Paper-track until nโ‰ฅ50.
4CL=F duplicated at 2 prices ($68.25 / $73)HIGHKeep commodity entry, drop futures entry.
5FOREX resolver bug โ€” 63% wins are 1bp "flicker"HIGHAlready blocked by kill gate. Fix asymmetric TP/SL before re-enabling.
6COMMODITY pipeline broken โ€” top systems have n=0 in resolved DBHIGHInvestigate table mismatch. multi_asset_cot PF=4.72 invisible to OOS validator.

4. Areas for Improvement

findtorontoevents.ca/audit/

findtorontoevents.ca/audit/ai-tournament.html

Overall System

5. 1-Week Prediction

Resolver fires May 30 at 23:00 UTC.

PickDirectionPredictionConfidenceRationale
TLT LONGBULLISHWINHIGH7/7 model consensus. Cleanest macro expression.
SPY SHORTBEARISHWINHIGHFoundation of risk-off book. Market correction continuing.
GLD LONGBULLISHWINHIGHStagflation hedge thesis intact.
PG SHORTBEARISHWINMEDIUMOnly verified-WR equity pick (64%).
CL=F SHORTBEARISHLOSSLOWNear-zero EV. Oil geopolitical risk.
SHY SHORTBEARISHLOSSLOWCoin flip. Rate cut priced in.

6. Key Files

FileContent
reports/AI_HEDGE_FUND_SIMULATION_EXECUTIVE_SUMMARY_2026-05-24.mdComplete 159-line executive summary
audit_dashboard/hedge_fund_simulation_20260524.html3-round debate results (7 models, per-agent insights)
audit_dashboard/curated_picks_20260524.htmlTop 3 picks per asset class
updates/2026-05-24-cross-asset-statistical-analysis.mdPer-pick EV/Sharpe/Kelly, correlation matrix
reports/CONFIDENCE_METHODOLOGY_2026-05-24.md3 confidence methods, thresholds, calibration gaps

โš ๏ธ Not financial advice. Educational/research simulation only. Zero real money deployed. AI Tournament ยท Curated Picks ยท Updates