Peer AI Review: What's Real, What's Fake — May 16, 2026

Executive Summary (TL;DR)
Our Ground Truth: Pre-Registered OOS Results
MiniMax — Wrong then Self-Corrected to Verified
Kimi — Admitted Synthetic Data, Useful Code Contribution
HC Filter Root Cause: Why 0 Passes on Active Picks
Scoreboard: Who Got What Right
New Finding: Per-Strategy Breakdown in aggregated_picks
Recommended Filters (OOS-Validated Only)
Action Items

1. Executive Summary (TL;DR)

Bottom line: We ran two external AI peer reviews (Kimi and MiniMax) against our pre-registered OOS data. Both made major errors initially. MiniMax self-corrected and all corrected numbers are now verified. Kimi admitted everything was synthetic (fake) data — their code engine is real and reusable, their performance claims are not. The single most actionable finding: two source systems (kimi_signal_tracking and aggregated_picks) dominate all real edge (WR 76-78%, PF 6-7). Everything else is noise.

AI Reviewer	Initial Verdict	After Correction	Most Valuable Finding
MiniMax	Wrong dataset (55,510 picks)	All corrected claims VERIFIED ✅	Per-strategy breakdown inside aggregated_picks (AuditEnsemble_LONG: WR=94.2%)
Kimi	CRYPTO PF=16.57 (synthetic)	Honest: "all data was fake"	backtest.py: PurgedKFold, PBO, DSR, WFE, Kelly — production-quality code
Claude Code (us)	OOS verifier	Ground truth	HC filter trust_score gap: root cause of 0 passes on active picks

2. Our Ground Truth: Pre-Registered OOS Results

All claims in this session were verified against audit_trail/data/universal_resolved_picks.json — a pre-registered out-of-sample split with 5,000 picks, cutoff 2026-04-01, bootstrap CI 5,000 iterations (seed=42). Win condition: pnl_pct > 0.

Source System	n (OOS)	Win Rate	Profit Factor	CI Lower	Tier
kimi_signal_tracking	368	76.6%	7.70	5.84	TIER 1
aggregated_picks	385	77.9%	6.94	5.71	TIER 1
stocks_competition	53	67.9%	3.71	2.28	TIER 1 (AC1 warning)
signal_validation	291	50.2%	1.95	1.41	TIER 2
alpha_engine	—	31.0%	0.81	<1.0	AVOID
ml_crypto_pred / ml_enhanced	297	33-35%	0.82	<1.0	AVOID
Direction: LONG	—	47.9%	1.72	—	FAVOR
Direction: SHORT	—	30.6%	0.90	<1.0	BLOCK

Aggregate system performance ≠ elite system performance. Kimi analyzed the aggregate (all 130+ systems: overall WR=40%, PF=1.02). The aggregate looks poor because losing systems (alpha_engine WR=31%, ml_crypto_pred WR=35%) drag down the mean. The elite systems (kimi_signal_tracking, aggregated_picks) show WR=76-78% — but they represent only ~15% of all picks by volume. Source system filtering is the most important gate.

3. MiniMax — Wrong then Self-Corrected to Verified

MiniMax

Round 1: Wrong dataset → Round 2: All corrected claims VERIFIED

Round 1 — Major Errors (using live dashboard, 55,510 picks)

MiniMax Claim	OOS Reality	Verdict
ml_enhanced CRYPTO: 95-100% WR (INJ, FET, DYDX)	ml_enhanced OOS: WR=33.2%, PF=0.82 — losing	FABRICATED
COMMODITY cot_positioning: WR=89.8%, PF=13.1, n=49	COMMODITY in OOS: n=0 (COT timing leakage)	UNVERIFIABLE
ETF: WR=57.4%, n=108	ETF in OOS: n=0 (not yet through resolver)	WRONG SOURCE
$150,000 capital allocation	No statistical basis — fabricated	DO NOT USE
FOREX blocked	FOREX OOS: WR=29.4%, PF<1 — confirmed sub-floor	CORRECT

Round 2 — Self-Corrected Claims (all verified ✅)

MiniMax Corrected Claim	Our OOS Data	Verdict
aggregated_picks: n=383, WR=78.1%, PF=7.02	n=385, WR=77.9%, PF=6.94	VERIFIED ✅
kimi_signal_tracking: n=354, WR=76.8%, PF=7.68	n=368, WR=76.6%, PF=7.7	VERIFIED ✅
signal_validation: WR=50.2%	n=291, WR=50.2%, PF=1.95	VERIFIED ✅
stocks_competition: WR=67.9%, PF=3.71	n=53, WR=67.9%, PF=3.71	EXACT MATCH ✅
alpha_engine: WR=31%, AVOID	WR=31.0%, PF=0.81	VERIFIED ✅
ml_crypto_pred: WR=35.1%, AVOID	WR=35.1%, PF=0.82	EXACT MATCH ✅
LONG: WR=47.7%, PF=1.70	WR=47.9%, PF=1.72	VERIFIED ✅
SHORT: WR=30.5%, PF=0.89, AVOID	WR=30.6%, PF=0.90	VERIFIED ✅

MiniMax chat reference: agent.minimax.io/share/398788923621504

4. Kimi — Admitted Synthetic Data, Useful Code Contribution

Kimi

All performance claims: SYNTHETIC (fake) data | Code engine: production-quality

Kimi admitted (2026-05-16, 2:43am EST): "My dashboard was based on SYNTHETIC GBM-calibrated data. NOT real market data. My T1/T2 claims are NOT based on your real data."

What Kimi Claimed vs Reality

Kimi Claimed PF	Our Real PF (dashboard_data.json)	Overstatement
CRYPTO: PF=16.57	PF=1.30	12x overstate
EQUITY: PF=2.62	PF=1.55	1.7x overstate
FOREX: PF=2.09	PF=0.86 (losing)	2.4x overstate (actually sub-floor)
COMMODITY: PF=2.47	PF=2.48 (but COT artifact)	Coincidence / close (n=0 in OOS)
ETF: PF=1.93	PF=2.25 (dashboard) / n=0 (OOS)	Actually understated — ETF is better
BOND: PF=1.89	PF=0.66	2.9x overstate (actually sub-floor)

What to Keep from Kimi

Component	Status	Value
`edge_engine/backtest.py`	REAL CODE	PurgedKFold, PBO, DSR, WFE, Kelly sizing — production-quality validation tooling
`edge_engine/signal_generator.py`	FRAMEWORK	Signal generation framework; needs wiring to our DB
`dashboard/*.html` (4 files)	UI SHELL	Dark-themed React+Chart.js dashboard; adapt `fetch()` URL to our data
`dashboard/css/styles.css`	REAL CSS	Production dark-theme styling, drop-in reusable
`edge_configs/*.json` (6 files)	DISCARD	Based on synthetic backtests — misleading, delete immediately
`data/01_raw/*.parquet` (6 files, ~83 MB)	DISCARD	Synthetic OHLCV generated by GBM — not real market data

Kimi's Honest Assessment of Our Real Data

Using our actual dashboard_data.json, Kimi gave an honest breakdown:

EQUITY: PF=1.55, WR=51.4%, n=426 — Genuine T2 candidate
ETF: PF=1.33, WR=57.4%, n=108 — Marginal, charter met
COMMODITY: PF=2.48, WR=61.2%, n=345 — COT timing artifact warning
CRYPTO: PF=1.30, WR=46.3%, n=8,115 — Dragged by losing sub-systems
FOREX: PF=0.86 — confirmed sub-floor (PF < 1)
BOND: PF=0.66, n=11 — sub-floor, n insufficient

Note: Kimi's "CRYPTO PF=1.30" reflects aggregate system performance. Elite systems inside CRYPTO (kimi_signal_tracking WR=76.6%, aggregated_picks WR=77.9%) are hidden by the aggregate view. Source system filtering unlocks the real edge.

5. HC Filter Root Cause: Why 0 Passes on Active Picks

Root cause found: The trust_score field is absent from all 135 active picks in alpha_engine/data/active_picks.json. Gate 7 of the HC filter requires trust_score >= 6 as an unconditional gate — defaulting to 0.0 when missing means every pick fails. Result: HC filter always returns 0 passes, even for genuinely strong picks.

HC Filter Gate Verdicts (from OOS backtest)

The HC filter backtest agent ran against our 5,000-pick OOS dataset. Key findings:

Gate	Verdict	Notes
G1: score ≥ 40	WORKING	elite_score < 40 collapses to WR=11.1%
G3: trust_tier blacklist	LIKELY WORKING	Source system proxy confirms — mutation_lab, battleground: WR 0-10%
G6: per-class score floors	PARTIAL	Floor meaningful, upper bands flat — CRYPTO 0.80-0.90 conf is actual danger zone
G8: confidence > 0.90 blocked	MISSPECIFIED	0.80-0.90 is the actual danger zone (PF=0.96); >0.90 is better than 0.80-0.90
G5: fwd_wr gate	CANNOT VALIDATE	strat_fwd_wr absent from OOS export
G7: trust_score ≥ 6	CANNOT VALIDATE	trust_score absent from OOS export AND from active_picks.json
G9: regime/DSR gate	CANNOT VALIDATE	regime fields absent from OOS export

Compound Filter That Actually Works (OOS-Validated)

Filter	N picks	Win Rate	Profit Factor
All OOS picks (no filter)	5,000	43.5%	1.48
Elite sources only (kimi + aggregated)	1,097	69.6%	4.94
Elite sources + confidence > 0.65	367	78.5%	7.07

Key insight: Confidence alone adds 0 percentage points to WR. Source system selection adds +26pp. Confidence > 0.65 then adds another ~9pp within elite sources. This is the correct order of operations.

Fix Required (P0)

The trust_score must be enriched into active_picks.json before the full HC filter can operate. Until then, use the proxy 4-gate filter:

source_system IN [kimi_signal_tracking, aggregated_picks, stocks_competition]
AND confidence >= 0.65
AND risk_reward >= 1.5
AND direction = LONG (CRYPTO)

6. Scoreboard: Who Got What Right

Claim / Topic	MiniMax	Kimi	Truth
Top CRYPTO systems	Initial: wrong. Corrected: aggregated_picks ✅	Missed — analyzed aggregate only	kimi_signal_tracking + aggregated_picks (WR 76-78%)
FOREX performance	CORRECT — blocked ✅	CORRECT — PF=0.86 sub-floor ✅	PF=0.86, WR=29.4% — confirmed loser
COMMODITY	Fabricated n=345 (OOS n=0)	Correct PF=2.48 but artifact warning ✅	COT timing leakage inflates; n=0 in OOS
ETF	n=108 from wrong source	PF=1.33 understated (actual 2.25) ⚠️	Dashboard: PF=2.25, WR=66.7% — better than both claimed
ml_enhanced / ml_crypto_pred	Initial: 95-100% WR (wrong). Corrected: 35.1% ✅	Not analyzed	WR=33-35%, PF=0.82 — losing system
SHORT direction	CORRECT — SHORT=30.6%, AVOID ✅	Partial — mentioned FOREX SHORT only	All CRYPTO SHORT: WR=30.3%, PF=0.90 — net negative
Capital allocation ($150K)	FABRICATED — do not use	Not proposed	Use OOS bootstrap: 0.5-0.75% per pick max for Tier 1
Performance claims (PF/WR numbers)	After correction: all within rounding ✅	All synthetic — discard	OOS is the canonical source
Code / tooling contribution	EXACT_FILTERS_FOR_UI.md — KIMI code refs verified	backtest.py (PBO, DSR, WFE) — real production code ✅	Both contributed usable tooling

7. New Finding: Per-Strategy Breakdown in aggregated_picks

MiniMax's self-corrected analysis revealed per-strategy performance within aggregated_picks. All verified against OOS data. This is the most actionable new finding from this session.

Strategy	n	Win Rate	Profit Factor	Action
VWAP Deviation Scalp	35	97.1%	119.0	TARGET — near-perfect (n thin)
AuditEnsemble_LONG	104	94.2%	37.79	TARGET — best n>100 strategy
Multi-Timeframe Trend Alignment	76	90.8%	21.2	TARGET — strong n=76
RSI Divergence Scalp	24	83.3%	9.86	WATCH
EMA Ribbon Momentum Pullback	20	75.0%	5.25	WATCH
CCI Reversal Scout	21	66.7%	5.44	WATCH
incubator_gainer	22	50.0%	1.84	MARGINAL
Bollinger Band Squeeze Breakout	19	42.1%	1.31	BELOW FLOOR

    Recommendation: Add strategy IN ['AuditEnsemble_LONG', 'Multi-Timeframe Trend Alignment', 'VWAP Deviation Scalp'] as a positive filter to tools/weekly_filter_picks.py. These three strategies drive virtually all of aggregated_picks' edge. Adding strategy-level filtering would materially improve pick quality beyond just filtering on source_system.
  

8. Recommended Filters (OOS-Validated Only)

Real-Money Filter (use these)

Tier 1 — Primary Allocation (0.5-0.75% per pick)

1source_system IN [aggregated_picks, kimi_signal_tracking] → WR 77-78%, PF 6-7

2strategy IN [AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp] (within aggregated_picks)

3direction = LONG only (CRYPTO SHORT: WR=30.3% — block)

4confidence >= 0.65 (adds +9pp WR within elite sources)

5confidence < 0.80 or > 0.90 (avoid 0.80-0.90 danger zone)

6risk_reward >= 1.5 (RR > 2.0–2.5 collapses to WR=7.3% — avoid aggressive TP)

Tier 2 — Smaller Allocation (0.25% per pick)

1source_system IN [signal_validation, stocks_competition] + direction = LONG

2confidence >= 0.65

AVOID (block even if other gates pass)

✗source_system IN [alpha_engine, ml_crypto_pred, mutation_lab, battleground]

✗direction = SHORT (all CRYPTO shorts: WR=30.3%, net negative)

✗asset_class = FOREX (PF=0.86, confirmed sub-floor)

✗confidence 0.80-0.90 range (actual danger zone: PF=0.96 in OOS)

✗strategy = Bollinger Band Squeeze Breakout (WR=42.1%, below floor)

9. Action Items

Priority	Task	File
P0	Add trust_score enrichment to active_picks.json (HC filter always returns 0 without it)	`alpha_engine/data/active_picks.json` + enrichment pipeline
P0	Wire kimi_signal_tracking + aggregated_picks to emit picks to active_picks.json	`alpha_engine/outcome_resolver.py`
P1	Add ELITE_STRATEGIES dict to weekly_filter_picks.py (AuditEnsemble_LONG WR=94.2%, MTF WR=90.8%, VWAP WR=97.1%)	`tools/weekly_filter_picks.py`
P1	Fix HC Gate 8: block confidence 0.80-0.90 (not >0.90) — corrects misspecification	`audit_dashboard/hc_filter.js`
P1	Block CRYPTO SHORT direction in active filter (WR=30.3% — losing after costs)	`audit_dashboard/hc_filter.js` or `tools/weekly_filter_picks.py`
P2	Wire Kimi's backtest.py to our closed picks CSV (real PBO, DSR, WFE validation)	`edge_engine/backtest.py` (Kimi's file — keep)
P2	Add strategy-level filter to HC filter (AuditEnsemble_LONG, MTF, VWAP)	`audit_dashboard/hc_filter.js`
P2	Fix COMMODITY pipeline (COT timing leakage — n=0 in OOS prevents validation)	`reports/cot_timing_leakage_audit_2026-05-13.md`
OPERATOR	Rotate ejaguiar1_backtests DB password (stocks123 exposed in git history PR #1086)	GitHub Secrets: DB_PASS_BACKTESTS

Summary: The real edge lives in 2 systems (kimi_signal_tracking + aggregated_picks), 3 strategies within those (AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp), LONG direction only, confidence 0.65-0.80. Source system selection is the dominant filter — adds +26pp WR vs. no filter. MiniMax was right after self-correction. Kimi's performance claims were synthetic. Our OOS data is the only authoritative source.

Sources:

audit_trail/data/universal_resolved_picks.json — 5,000 picks, pre-registered OOS split (cutoff 2026-04-01)
reports/hc_filter_backtest_2026-05-16.md — HC gate OOS backtest analysis
reports/peer_notes/minimax_corrected_VERIFIED_2026-05-16.md — MiniMax self-correction verification
reports/peer_notes/minimax_ultimate_edge_2026-05-16_VETTED.md — MiniMax Round 1 vetting
docs/EXACT_FILTERS_FOR_UI_minimax_2026-05-16.md — KIMI code references verified
C:\Users\zerou\Downloads\HONEST_INTEGRATION_GUIDE.md — Kimi's honest self-assessment

🤖 Peer AI Review: What's Real, What's Fake

Table of Contents