Table of Contents

  1. Executive Summary (TL;DR)
  2. Our Ground Truth: Pre-Registered OOS Results
  3. MiniMax — Wrong then Self-Corrected to Verified
  4. Kimi — Admitted Synthetic Data, Useful Code Contribution
  5. HC Filter Root Cause: Why 0 Passes on Active Picks
  6. Scoreboard: Who Got What Right
  7. New Finding: Per-Strategy Breakdown in aggregated_picks
  8. Recommended Filters (OOS-Validated Only)
  9. Action Items

1. Executive Summary (TL;DR)

Bottom line: We ran two external AI peer reviews (Kimi and MiniMax) against our pre-registered OOS data. Both made major errors initially. MiniMax self-corrected and all corrected numbers are now verified. Kimi admitted everything was synthetic (fake) data — their code engine is real and reusable, their performance claims are not. The single most actionable finding: two source systems (kimi_signal_tracking and aggregated_picks) dominate all real edge (WR 76-78%, PF 6-7). Everything else is noise.
AI ReviewerInitial VerdictAfter CorrectionMost Valuable Finding
MiniMax Wrong dataset (55,510 picks) All corrected claims VERIFIED ✅ Per-strategy breakdown inside aggregated_picks (AuditEnsemble_LONG: WR=94.2%)
Kimi CRYPTO PF=16.57 (synthetic) Honest: "all data was fake" backtest.py: PurgedKFold, PBO, DSR, WFE, Kelly — production-quality code
Claude Code (us) OOS verifier Ground truth HC filter trust_score gap: root cause of 0 passes on active picks

2. Our Ground Truth: Pre-Registered OOS Results

All claims in this session were verified against audit_trail/data/universal_resolved_picks.json — a pre-registered out-of-sample split with 5,000 picks, cutoff 2026-04-01, bootstrap CI 5,000 iterations (seed=42). Win condition: pnl_pct > 0.

Source Systemn (OOS)Win RateProfit FactorCI LowerTier
kimi_signal_tracking 368 76.6% 7.70 5.84 TIER 1
aggregated_picks 385 77.9% 6.94 5.71 TIER 1
stocks_competition 53 67.9% 3.71 2.28 TIER 1 (AC1 warning)
signal_validation 291 50.2% 1.95 1.41 TIER 2
alpha_engine 31.0% 0.81 <1.0 AVOID
ml_crypto_pred / ml_enhanced 297 33-35% 0.82 <1.0 AVOID
Direction: LONG 47.9% 1.72 FAVOR
Direction: SHORT 30.6% 0.90 <1.0 BLOCK
Aggregate system performance ≠ elite system performance. Kimi analyzed the aggregate (all 130+ systems: overall WR=40%, PF=1.02). The aggregate looks poor because losing systems (alpha_engine WR=31%, ml_crypto_pred WR=35%) drag down the mean. The elite systems (kimi_signal_tracking, aggregated_picks) show WR=76-78% — but they represent only ~15% of all picks by volume. Source system filtering is the most important gate.

3. MiniMax — Wrong then Self-Corrected to Verified

MiniMax
Round 1: Wrong dataset  →  Round 2: All corrected claims VERIFIED

Round 1 — Major Errors (using live dashboard, 55,510 picks)

MiniMax ClaimOOS RealityVerdict
ml_enhanced CRYPTO: 95-100% WR (INJ, FET, DYDX)ml_enhanced OOS: WR=33.2%, PF=0.82 — losingFABRICATED
COMMODITY cot_positioning: WR=89.8%, PF=13.1, n=49COMMODITY in OOS: n=0 (COT timing leakage)UNVERIFIABLE
ETF: WR=57.4%, n=108ETF in OOS: n=0 (not yet through resolver)WRONG SOURCE
$150,000 capital allocationNo statistical basis — fabricatedDO NOT USE
FOREX blockedFOREX OOS: WR=29.4%, PF<1 — confirmed sub-floorCORRECT

Round 2 — Self-Corrected Claims (all verified ✅)

MiniMax Corrected ClaimOur OOS DataVerdict
aggregated_picks: n=383, WR=78.1%, PF=7.02n=385, WR=77.9%, PF=6.94VERIFIED ✅
kimi_signal_tracking: n=354, WR=76.8%, PF=7.68n=368, WR=76.6%, PF=7.7VERIFIED ✅
signal_validation: WR=50.2%n=291, WR=50.2%, PF=1.95VERIFIED ✅
stocks_competition: WR=67.9%, PF=3.71n=53, WR=67.9%, PF=3.71EXACT MATCH ✅
alpha_engine: WR=31%, AVOIDWR=31.0%, PF=0.81VERIFIED ✅
ml_crypto_pred: WR=35.1%, AVOIDWR=35.1%, PF=0.82EXACT MATCH ✅
LONG: WR=47.7%, PF=1.70WR=47.9%, PF=1.72VERIFIED ✅
SHORT: WR=30.5%, PF=0.89, AVOIDWR=30.6%, PF=0.90VERIFIED ✅

MiniMax chat reference: agent.minimax.io/share/398788923621504

4. Kimi — Admitted Synthetic Data, Useful Code Contribution

Kimi
All performance claims: SYNTHETIC (fake) data  |  Code engine: production-quality
Kimi admitted (2026-05-16, 2:43am EST): "My dashboard was based on SYNTHETIC GBM-calibrated data. NOT real market data. My T1/T2 claims are NOT based on your real data."

What Kimi Claimed vs Reality

Kimi Claimed PFOur Real PF (dashboard_data.json)Overstatement
CRYPTO: PF=16.57PF=1.3012x overstate
EQUITY: PF=2.62PF=1.551.7x overstate
FOREX: PF=2.09PF=0.86 (losing)2.4x overstate (actually sub-floor)
COMMODITY: PF=2.47PF=2.48 (but COT artifact)Coincidence / close (n=0 in OOS)
ETF: PF=1.93PF=2.25 (dashboard) / n=0 (OOS)Actually understated — ETF is better
BOND: PF=1.89PF=0.662.9x overstate (actually sub-floor)

What to Keep from Kimi

ComponentStatusValue
edge_engine/backtest.pyREAL CODEPurgedKFold, PBO, DSR, WFE, Kelly sizing — production-quality validation tooling
edge_engine/signal_generator.pyFRAMEWORKSignal generation framework; needs wiring to our DB
dashboard/*.html (4 files)UI SHELLDark-themed React+Chart.js dashboard; adapt fetch() URL to our data
dashboard/css/styles.cssREAL CSSProduction dark-theme styling, drop-in reusable
edge_configs/*.json (6 files)DISCARDBased on synthetic backtests — misleading, delete immediately
data/01_raw/*.parquet (6 files, ~83 MB)DISCARDSynthetic OHLCV generated by GBM — not real market data

Kimi's Honest Assessment of Our Real Data

Using our actual dashboard_data.json, Kimi gave an honest breakdown:

Note: Kimi's "CRYPTO PF=1.30" reflects aggregate system performance. Elite systems inside CRYPTO (kimi_signal_tracking WR=76.6%, aggregated_picks WR=77.9%) are hidden by the aggregate view. Source system filtering unlocks the real edge.

5. HC Filter Root Cause: Why 0 Passes on Active Picks

Root cause found: The trust_score field is absent from all 135 active picks in alpha_engine/data/active_picks.json. Gate 7 of the HC filter requires trust_score >= 6 as an unconditional gate — defaulting to 0.0 when missing means every pick fails. Result: HC filter always returns 0 passes, even for genuinely strong picks.

HC Filter Gate Verdicts (from OOS backtest)

The HC filter backtest agent ran against our 5,000-pick OOS dataset. Key findings:

GateVerdictNotes
G1: score ≥ 40WORKINGelite_score < 40 collapses to WR=11.1%
G3: trust_tier blacklistLIKELY WORKINGSource system proxy confirms — mutation_lab, battleground: WR 0-10%
G6: per-class score floorsPARTIALFloor meaningful, upper bands flat — CRYPTO 0.80-0.90 conf is actual danger zone
G8: confidence > 0.90 blockedMISSPECIFIED0.80-0.90 is the actual danger zone (PF=0.96); >0.90 is better than 0.80-0.90
G5: fwd_wr gateCANNOT VALIDATEstrat_fwd_wr absent from OOS export
G7: trust_score ≥ 6CANNOT VALIDATEtrust_score absent from OOS export AND from active_picks.json
G9: regime/DSR gateCANNOT VALIDATEregime fields absent from OOS export

Compound Filter That Actually Works (OOS-Validated)

FilterN picksWin RateProfit Factor
All OOS picks (no filter)5,00043.5%1.48
Elite sources only (kimi + aggregated)1,09769.6%4.94
Elite sources + confidence > 0.6536778.5%7.07
Key insight: Confidence alone adds 0 percentage points to WR. Source system selection adds +26pp. Confidence > 0.65 then adds another ~9pp within elite sources. This is the correct order of operations.

Fix Required (P0)

The trust_score must be enriched into active_picks.json before the full HC filter can operate. Until then, use the proxy 4-gate filter:

source_system IN [kimi_signal_tracking, aggregated_picks, stocks_competition] AND confidence >= 0.65 AND risk_reward >= 1.5 AND direction = LONG (CRYPTO)

6. Scoreboard: Who Got What Right

Claim / TopicMiniMaxKimiTruth
Top CRYPTO systems Initial: wrong. Corrected: aggregated_picks ✅ Missed — analyzed aggregate only kimi_signal_tracking + aggregated_picks (WR 76-78%)
FOREX performance CORRECT — blocked ✅ CORRECT — PF=0.86 sub-floor ✅ PF=0.86, WR=29.4% — confirmed loser
COMMODITY Fabricated n=345 (OOS n=0) Correct PF=2.48 but artifact warning ✅ COT timing leakage inflates; n=0 in OOS
ETF n=108 from wrong source PF=1.33 understated (actual 2.25) ⚠️ Dashboard: PF=2.25, WR=66.7% — better than both claimed
ml_enhanced / ml_crypto_pred Initial: 95-100% WR (wrong). Corrected: 35.1% ✅ Not analyzed WR=33-35%, PF=0.82 — losing system
SHORT direction CORRECT — SHORT=30.6%, AVOID ✅ Partial — mentioned FOREX SHORT only All CRYPTO SHORT: WR=30.3%, PF=0.90 — net negative
Capital allocation ($150K) FABRICATED — do not use Not proposed Use OOS bootstrap: 0.5-0.75% per pick max for Tier 1
Performance claims (PF/WR numbers) After correction: all within rounding ✅ All synthetic — discard OOS is the canonical source
Code / tooling contribution EXACT_FILTERS_FOR_UI.md — KIMI code refs verified backtest.py (PBO, DSR, WFE) — real production code ✅ Both contributed usable tooling

7. New Finding: Per-Strategy Breakdown in aggregated_picks

MiniMax's self-corrected analysis revealed per-strategy performance within aggregated_picks. All verified against OOS data. This is the most actionable new finding from this session.

StrategynWin RateProfit FactorAction
VWAP Deviation Scalp 35 97.1% 119.0 TARGET — near-perfect (n thin)
AuditEnsemble_LONG 104 94.2% 37.79 TARGET — best n>100 strategy
Multi-Timeframe Trend Alignment 76 90.8% 21.2 TARGET — strong n=76
RSI Divergence Scalp 24 83.3% 9.86 WATCH
EMA Ribbon Momentum Pullback 20 75.0% 5.25 WATCH
CCI Reversal Scout 21 66.7% 5.44 WATCH
incubator_gainer 22 50.0% 1.84 MARGINAL
Bollinger Band Squeeze Breakout 19 42.1% 1.31 BELOW FLOOR
Recommendation: Add strategy IN ['AuditEnsemble_LONG', 'Multi-Timeframe Trend Alignment', 'VWAP Deviation Scalp'] as a positive filter to tools/weekly_filter_picks.py. These three strategies drive virtually all of aggregated_picks' edge. Adding strategy-level filtering would materially improve pick quality beyond just filtering on source_system.

8. Recommended Filters (OOS-Validated Only)

Real-Money Filter (use these)

Tier 1 — Primary Allocation (0.5-0.75% per pick)

1source_system IN [aggregated_picks, kimi_signal_tracking] → WR 77-78%, PF 6-7
2strategy IN [AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp] (within aggregated_picks)
3direction = LONG only (CRYPTO SHORT: WR=30.3% — block)
4confidence >= 0.65 (adds +9pp WR within elite sources)
5confidence < 0.80 or > 0.90 (avoid 0.80-0.90 danger zone)
6risk_reward >= 1.5 (RR > 2.0–2.5 collapses to WR=7.3% — avoid aggressive TP)

Tier 2 — Smaller Allocation (0.25% per pick)

1source_system IN [signal_validation, stocks_competition] + direction = LONG
2confidence >= 0.65

AVOID (block even if other gates pass)

source_system IN [alpha_engine, ml_crypto_pred, mutation_lab, battleground]
direction = SHORT (all CRYPTO shorts: WR=30.3%, net negative)
asset_class = FOREX (PF=0.86, confirmed sub-floor)
confidence 0.80-0.90 range (actual danger zone: PF=0.96 in OOS)
strategy = Bollinger Band Squeeze Breakout (WR=42.1%, below floor)

9. Action Items

PriorityTaskFile
P0 Add trust_score enrichment to active_picks.json (HC filter always returns 0 without it) alpha_engine/data/active_picks.json + enrichment pipeline
P0 Wire kimi_signal_tracking + aggregated_picks to emit picks to active_picks.json alpha_engine/outcome_resolver.py
P1 Add ELITE_STRATEGIES dict to weekly_filter_picks.py (AuditEnsemble_LONG WR=94.2%, MTF WR=90.8%, VWAP WR=97.1%) tools/weekly_filter_picks.py
P1 Fix HC Gate 8: block confidence 0.80-0.90 (not >0.90) — corrects misspecification audit_dashboard/hc_filter.js
P1 Block CRYPTO SHORT direction in active filter (WR=30.3% — losing after costs) audit_dashboard/hc_filter.js or tools/weekly_filter_picks.py
P2 Wire Kimi's backtest.py to our closed picks CSV (real PBO, DSR, WFE validation) edge_engine/backtest.py (Kimi's file — keep)
P2 Add strategy-level filter to HC filter (AuditEnsemble_LONG, MTF, VWAP) audit_dashboard/hc_filter.js
P2 Fix COMMODITY pipeline (COT timing leakage — n=0 in OOS prevents validation) reports/cot_timing_leakage_audit_2026-05-13.md
OPERATOR Rotate ejaguiar1_backtests DB password (stocks123 exposed in git history PR #1086) GitHub Secrets: DB_PASS_BACKTESTS
Summary: The real edge lives in 2 systems (kimi_signal_tracking + aggregated_picks), 3 strategies within those (AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp), LONG direction only, confidence 0.65-0.80. Source system selection is the dominant filter — adds +26pp WR vs. no filter. MiniMax was right after self-correction. Kimi's performance claims were synthetic. Our OOS data is the only authoritative source.
Sources: