1. Executive Summary (TL;DR)
Bottom line: We ran two external AI peer reviews (Kimi and MiniMax) against our pre-registered OOS data. Both made major errors initially. MiniMax self-corrected and all corrected numbers are now verified. Kimi admitted everything was synthetic (fake) data — their code engine is real and reusable, their performance claims are not. The single most actionable finding: two source systems (kimi_signal_tracking and aggregated_picks) dominate all real edge (WR 76-78%, PF 6-7). Everything else is noise.
| AI Reviewer | Initial Verdict | After Correction | Most Valuable Finding |
| MiniMax |
Wrong dataset (55,510 picks) |
All corrected claims VERIFIED ✅ |
Per-strategy breakdown inside aggregated_picks (AuditEnsemble_LONG: WR=94.2%) |
| Kimi |
CRYPTO PF=16.57 (synthetic) |
Honest: "all data was fake" |
backtest.py: PurgedKFold, PBO, DSR, WFE, Kelly — production-quality code |
| Claude Code (us) |
OOS verifier |
Ground truth |
HC filter trust_score gap: root cause of 0 passes on active picks |
2. Our Ground Truth: Pre-Registered OOS Results
All claims in this session were verified against audit_trail/data/universal_resolved_picks.json — a pre-registered out-of-sample split with 5,000 picks, cutoff 2026-04-01, bootstrap CI 5,000 iterations (seed=42). Win condition: pnl_pct > 0.
| Source System | n (OOS) | Win Rate | Profit Factor | CI Lower | Tier |
| kimi_signal_tracking |
368 |
76.6% |
7.70 |
5.84 |
TIER 1 |
| aggregated_picks |
385 |
77.9% |
6.94 |
5.71 |
TIER 1 |
| stocks_competition |
53 |
67.9% |
3.71 |
2.28 |
TIER 1 (AC1 warning) |
| signal_validation |
291 |
50.2% |
1.95 |
1.41 |
TIER 2 |
| alpha_engine |
— |
31.0% |
0.81 |
<1.0 |
AVOID |
| ml_crypto_pred / ml_enhanced |
297 |
33-35% |
0.82 |
<1.0 |
AVOID |
| Direction: LONG |
— |
47.9% |
1.72 |
— |
FAVOR |
| Direction: SHORT |
— |
30.6% |
0.90 |
<1.0 |
BLOCK |
Aggregate system performance ≠ elite system performance. Kimi analyzed the aggregate (all 130+ systems: overall WR=40%, PF=1.02). The aggregate looks poor because losing systems (alpha_engine WR=31%, ml_crypto_pred WR=35%) drag down the mean. The elite systems (kimi_signal_tracking, aggregated_picks) show WR=76-78% — but they represent only ~15% of all picks by volume. Source system filtering is the most important gate.
3. MiniMax — Wrong then Self-Corrected to Verified
MiniMax
Round 1: Wrong dataset → Round 2: All corrected claims VERIFIED
Round 1 — Major Errors (using live dashboard, 55,510 picks)
| MiniMax Claim | OOS Reality | Verdict |
| ml_enhanced CRYPTO: 95-100% WR (INJ, FET, DYDX) | ml_enhanced OOS: WR=33.2%, PF=0.82 — losing | FABRICATED |
| COMMODITY cot_positioning: WR=89.8%, PF=13.1, n=49 | COMMODITY in OOS: n=0 (COT timing leakage) | UNVERIFIABLE |
| ETF: WR=57.4%, n=108 | ETF in OOS: n=0 (not yet through resolver) | WRONG SOURCE |
| $150,000 capital allocation | No statistical basis — fabricated | DO NOT USE |
| FOREX blocked | FOREX OOS: WR=29.4%, PF<1 — confirmed sub-floor | CORRECT |
Round 2 — Self-Corrected Claims (all verified ✅)
| MiniMax Corrected Claim | Our OOS Data | Verdict |
| aggregated_picks: n=383, WR=78.1%, PF=7.02 | n=385, WR=77.9%, PF=6.94 | VERIFIED ✅ |
| kimi_signal_tracking: n=354, WR=76.8%, PF=7.68 | n=368, WR=76.6%, PF=7.7 | VERIFIED ✅ |
| signal_validation: WR=50.2% | n=291, WR=50.2%, PF=1.95 | VERIFIED ✅ |
| stocks_competition: WR=67.9%, PF=3.71 | n=53, WR=67.9%, PF=3.71 | EXACT MATCH ✅ |
| alpha_engine: WR=31%, AVOID | WR=31.0%, PF=0.81 | VERIFIED ✅ |
| ml_crypto_pred: WR=35.1%, AVOID | WR=35.1%, PF=0.82 | EXACT MATCH ✅ |
| LONG: WR=47.7%, PF=1.70 | WR=47.9%, PF=1.72 | VERIFIED ✅ |
| SHORT: WR=30.5%, PF=0.89, AVOID | WR=30.6%, PF=0.90 | VERIFIED ✅ |
MiniMax chat reference: agent.minimax.io/share/398788923621504
4. Kimi — Admitted Synthetic Data, Useful Code Contribution
Kimi
All performance claims: SYNTHETIC (fake) data | Code engine: production-quality
Kimi admitted (2026-05-16, 2:43am EST): "My dashboard was based on SYNTHETIC GBM-calibrated data. NOT real market data. My T1/T2 claims are NOT based on your real data."
What Kimi Claimed vs Reality
| Kimi Claimed PF | Our Real PF (dashboard_data.json) | Overstatement |
| CRYPTO: PF=16.57 | PF=1.30 | 12x overstate |
| EQUITY: PF=2.62 | PF=1.55 | 1.7x overstate |
| FOREX: PF=2.09 | PF=0.86 (losing) | 2.4x overstate (actually sub-floor) |
| COMMODITY: PF=2.47 | PF=2.48 (but COT artifact) | Coincidence / close (n=0 in OOS) |
| ETF: PF=1.93 | PF=2.25 (dashboard) / n=0 (OOS) | Actually understated — ETF is better |
| BOND: PF=1.89 | PF=0.66 | 2.9x overstate (actually sub-floor) |
What to Keep from Kimi
| Component | Status | Value |
edge_engine/backtest.py | REAL CODE | PurgedKFold, PBO, DSR, WFE, Kelly sizing — production-quality validation tooling |
edge_engine/signal_generator.py | FRAMEWORK | Signal generation framework; needs wiring to our DB |
dashboard/*.html (4 files) | UI SHELL | Dark-themed React+Chart.js dashboard; adapt fetch() URL to our data |
dashboard/css/styles.css | REAL CSS | Production dark-theme styling, drop-in reusable |
edge_configs/*.json (6 files) | DISCARD | Based on synthetic backtests — misleading, delete immediately |
data/01_raw/*.parquet (6 files, ~83 MB) | DISCARD | Synthetic OHLCV generated by GBM — not real market data |
Kimi's Honest Assessment of Our Real Data
Using our actual dashboard_data.json, Kimi gave an honest breakdown:
- EQUITY: PF=1.55, WR=51.4%, n=426 — Genuine T2 candidate
- ETF: PF=1.33, WR=57.4%, n=108 — Marginal, charter met
- COMMODITY: PF=2.48, WR=61.2%, n=345 — COT timing artifact warning
- CRYPTO: PF=1.30, WR=46.3%, n=8,115 — Dragged by losing sub-systems
- FOREX: PF=0.86 — confirmed sub-floor (PF < 1)
- BOND: PF=0.66, n=11 — sub-floor, n insufficient
Note: Kimi's "CRYPTO PF=1.30" reflects aggregate system performance. Elite systems inside CRYPTO (kimi_signal_tracking WR=76.6%, aggregated_picks WR=77.9%) are hidden by the aggregate view. Source system filtering unlocks the real edge.
5. HC Filter Root Cause: Why 0 Passes on Active Picks
Root cause found: The trust_score field is absent from all 135 active picks in alpha_engine/data/active_picks.json. Gate 7 of the HC filter requires trust_score >= 6 as an unconditional gate — defaulting to 0.0 when missing means every pick fails. Result: HC filter always returns 0 passes, even for genuinely strong picks.
HC Filter Gate Verdicts (from OOS backtest)
The HC filter backtest agent ran against our 5,000-pick OOS dataset. Key findings:
| Gate | Verdict | Notes |
| G1: score ≥ 40 | WORKING | elite_score < 40 collapses to WR=11.1% |
| G3: trust_tier blacklist | LIKELY WORKING | Source system proxy confirms — mutation_lab, battleground: WR 0-10% |
| G6: per-class score floors | PARTIAL | Floor meaningful, upper bands flat — CRYPTO 0.80-0.90 conf is actual danger zone |
| G8: confidence > 0.90 blocked | MISSPECIFIED | 0.80-0.90 is the actual danger zone (PF=0.96); >0.90 is better than 0.80-0.90 |
| G5: fwd_wr gate | CANNOT VALIDATE | strat_fwd_wr absent from OOS export |
| G7: trust_score ≥ 6 | CANNOT VALIDATE | trust_score absent from OOS export AND from active_picks.json |
| G9: regime/DSR gate | CANNOT VALIDATE | regime fields absent from OOS export |
Compound Filter That Actually Works (OOS-Validated)
| Filter | N picks | Win Rate | Profit Factor |
| All OOS picks (no filter) | 5,000 | 43.5% | 1.48 |
| Elite sources only (kimi + aggregated) | 1,097 | 69.6% | 4.94 |
| Elite sources + confidence > 0.65 | 367 | 78.5% | 7.07 |
Key insight: Confidence alone adds 0 percentage points to WR. Source system selection adds +26pp. Confidence > 0.65 then adds another ~9pp within elite sources. This is the correct order of operations.
Fix Required (P0)
The trust_score must be enriched into active_picks.json before the full HC filter can operate. Until then, use the proxy 4-gate filter:
source_system IN [kimi_signal_tracking, aggregated_picks, stocks_competition]
AND confidence >= 0.65
AND risk_reward >= 1.5
AND direction = LONG (CRYPTO)
6. Scoreboard: Who Got What Right
| Claim / Topic | MiniMax | Kimi | Truth |
| Top CRYPTO systems |
Initial: wrong. Corrected: aggregated_picks ✅ |
Missed — analyzed aggregate only |
kimi_signal_tracking + aggregated_picks (WR 76-78%) |
| FOREX performance |
CORRECT — blocked ✅ |
CORRECT — PF=0.86 sub-floor ✅ |
PF=0.86, WR=29.4% — confirmed loser |
| COMMODITY |
Fabricated n=345 (OOS n=0) |
Correct PF=2.48 but artifact warning ✅ |
COT timing leakage inflates; n=0 in OOS |
| ETF |
n=108 from wrong source |
PF=1.33 understated (actual 2.25) ⚠️ |
Dashboard: PF=2.25, WR=66.7% — better than both claimed |
| ml_enhanced / ml_crypto_pred |
Initial: 95-100% WR (wrong). Corrected: 35.1% ✅ |
Not analyzed |
WR=33-35%, PF=0.82 — losing system |
| SHORT direction |
CORRECT — SHORT=30.6%, AVOID ✅ |
Partial — mentioned FOREX SHORT only |
All CRYPTO SHORT: WR=30.3%, PF=0.90 — net negative |
| Capital allocation ($150K) |
FABRICATED — do not use |
Not proposed |
Use OOS bootstrap: 0.5-0.75% per pick max for Tier 1 |
| Performance claims (PF/WR numbers) |
After correction: all within rounding ✅ |
All synthetic — discard |
OOS is the canonical source |
| Code / tooling contribution |
EXACT_FILTERS_FOR_UI.md — KIMI code refs verified |
backtest.py (PBO, DSR, WFE) — real production code ✅ |
Both contributed usable tooling |
7. New Finding: Per-Strategy Breakdown in aggregated_picks
MiniMax's self-corrected analysis revealed per-strategy performance within aggregated_picks. All verified against OOS data. This is the most actionable new finding from this session.
| Strategy | n | Win Rate | Profit Factor | Action |
| VWAP Deviation Scalp |
35 |
97.1% |
119.0 |
TARGET — near-perfect (n thin) |
| AuditEnsemble_LONG |
104 |
94.2% |
37.79 |
TARGET — best n>100 strategy |
| Multi-Timeframe Trend Alignment |
76 |
90.8% |
21.2 |
TARGET — strong n=76 |
| RSI Divergence Scalp |
24 |
83.3% |
9.86 |
WATCH |
| EMA Ribbon Momentum Pullback |
20 |
75.0% |
5.25 |
WATCH |
| CCI Reversal Scout |
21 |
66.7% |
5.44 |
WATCH |
| incubator_gainer |
22 |
50.0% |
1.84 |
MARGINAL |
| Bollinger Band Squeeze Breakout |
19 |
42.1% |
1.31 |
BELOW FLOOR |
Recommendation: Add strategy IN ['AuditEnsemble_LONG', 'Multi-Timeframe Trend Alignment', 'VWAP Deviation Scalp'] as a positive filter to tools/weekly_filter_picks.py. These three strategies drive virtually all of aggregated_picks' edge. Adding strategy-level filtering would materially improve pick quality beyond just filtering on source_system.
8. Recommended Filters (OOS-Validated Only)
Real-Money Filter (use these)
Tier 1 — Primary Allocation (0.5-0.75% per pick)
1source_system IN [aggregated_picks, kimi_signal_tracking] → WR 77-78%, PF 6-7
2strategy IN [AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp] (within aggregated_picks)
3direction = LONG only (CRYPTO SHORT: WR=30.3% — block)
4confidence >= 0.65 (adds +9pp WR within elite sources)
5confidence < 0.80 or > 0.90 (avoid 0.80-0.90 danger zone)
6risk_reward >= 1.5 (RR > 2.0–2.5 collapses to WR=7.3% — avoid aggressive TP)
Tier 2 — Smaller Allocation (0.25% per pick)
1source_system IN [signal_validation, stocks_competition] + direction = LONG
2confidence >= 0.65
AVOID (block even if other gates pass)
✗source_system IN [alpha_engine, ml_crypto_pred, mutation_lab, battleground]
✗direction = SHORT (all CRYPTO shorts: WR=30.3%, net negative)
✗asset_class = FOREX (PF=0.86, confirmed sub-floor)
✗confidence 0.80-0.90 range (actual danger zone: PF=0.96 in OOS)
✗strategy = Bollinger Band Squeeze Breakout (WR=42.1%, below floor)
9. Action Items
| Priority | Task | File |
| P0 |
Add trust_score enrichment to active_picks.json (HC filter always returns 0 without it) |
alpha_engine/data/active_picks.json + enrichment pipeline |
| P0 |
Wire kimi_signal_tracking + aggregated_picks to emit picks to active_picks.json |
alpha_engine/outcome_resolver.py |
| P1 |
Add ELITE_STRATEGIES dict to weekly_filter_picks.py (AuditEnsemble_LONG WR=94.2%, MTF WR=90.8%, VWAP WR=97.1%) |
tools/weekly_filter_picks.py |
| P1 |
Fix HC Gate 8: block confidence 0.80-0.90 (not >0.90) — corrects misspecification |
audit_dashboard/hc_filter.js |
| P1 |
Block CRYPTO SHORT direction in active filter (WR=30.3% — losing after costs) |
audit_dashboard/hc_filter.js or tools/weekly_filter_picks.py |
| P2 |
Wire Kimi's backtest.py to our closed picks CSV (real PBO, DSR, WFE validation) |
edge_engine/backtest.py (Kimi's file — keep) |
| P2 |
Add strategy-level filter to HC filter (AuditEnsemble_LONG, MTF, VWAP) |
audit_dashboard/hc_filter.js |
| P2 |
Fix COMMODITY pipeline (COT timing leakage — n=0 in OOS prevents validation) |
reports/cot_timing_leakage_audit_2026-05-13.md |
| OPERATOR |
Rotate ejaguiar1_backtests DB password (stocks123 exposed in git history PR #1086) |
GitHub Secrets: DB_PASS_BACKTESTS |
Summary: The real edge lives in 2 systems (kimi_signal_tracking + aggregated_picks), 3 strategies within those (AuditEnsemble_LONG, Multi-Timeframe Trend Alignment, VWAP Deviation Scalp), LONG direction only, confidence 0.65-0.80. Source system selection is the dominant filter — adds +26pp WR vs. no filter. MiniMax was right after self-correction. Kimi's performance claims were synthetic. Our OOS data is the only authoritative source.
Sources:
audit_trail/data/universal_resolved_picks.json — 5,000 picks, pre-registered OOS split (cutoff 2026-04-01)
reports/hc_filter_backtest_2026-05-16.md — HC gate OOS backtest analysis
reports/peer_notes/minimax_corrected_VERIFIED_2026-05-16.md — MiniMax self-correction verification
reports/peer_notes/minimax_ultimate_edge_2026-05-16_VETTED.md — MiniMax Round 1 vetting
docs/EXACT_FILTERS_FOR_UI_minimax_2026-05-16.md — KIMI code references verified
C:\Users\zerou\Downloads\HONEST_INTEGRATION_GUIDE.md — Kimi's honest self-assessment