← Back to Updates

πŸ¦… EAGLE2 Session Summary

Model: DeepSeek V4 Auditor: Mercury 2 Date: 2026-06-02 Mode: Quant Review + Implementation

🎯 What we set out to fix

The /audit production book showed 0 profitable asset classes. Research-grade edge existed only in isolated labs and tournament paper books β€” never reaching the live, policy-clean layer. This was not a "wait longer" problem. It was a research-to-production translation failure plus resolver/label contamination.

Imagine you have a lemonade stand but you keep pouring from the wrong pitcher. The good lemonade (ETF dual momentum, tournament models) is sitting in the fridge, while the stand keeps serving watery mix from a broken faucet. We fixed the pipes.

πŸ” Root Cause Analysis

Problem 1: Research promoted from wrong evidence

production_scanner.py ingests from 11 signal sources β€” most without walk-forward proof. The lab's own verification engine showed all strategies with Monte Carlo p β‰ˆ 0.45–0.52 β€” none statistically significant at 95% CI. Yet hundreds of strategies still emitted into production.

You're picking players for your team from the tryout list without checking which ones actually won games. A few are stars, but most are dragging the team down because nobody looked at their real stats.

Problem 2: FOREX TIME_EXIT was 3 different values

Three separate resolvers used 48h, 120h, and 7 days for the same FOREX picks. This meant a pick could be marked as a WIN by one resolver and a LOSS by another β€” depending on which ran first.

Three stopwatches timing the same race, each running at a different speed. Nobody knows who actually won because the referee, line judge, and scorekeeper all disagree on when the race ended.

Problem 3: 0.1bp threshold made noise look like wins

A legacy 0.1 basis point WIN threshold classified spread noise as profits β€” driving 63% of FOREX wins and 67% of COMMODITY wins to be resolver flicker, not real edge.

Counting pocket lint as money. The scale was so sensitive that normal market friction registered as "profit." We raised the bar to 5bp so only real gains count.

Problem 4: Emitter over-breadth

paper_trading/strategies/ has 56 strategy files (~150 individual strategies). Only 6 had verified forward proof. The rest generated noise that diluted any real edge.

150 cooks in a kitchen but only 6 have food safety certificates. The food tastes bad not because nobody can cook β€” but because too many untrained hands are messing with the pots.

Problem 5: Concentration artifacts masquerading as edge

Single-symbol concentration (e.g., BNBUSDT) inflated strategy performance. A surface dominated by one source/system looked amazing but was fragile or fake.

Betting everything on one horse and bragging when it wins. Looks great on paper until that horse has a bad day and your whole strategy collapses.

βœ… What was accomplished (5 Phases)

Overall Progress Phases 1-5 complete Β· ~3,000 lines Β· 17 files Β· 3 commits to main
1

Data & Resolver Hygiene

FixFilesStatus
Unified FOREX TIME_EXIT
48h/120h/7d β†’ 72h across all 7 resolvers
force_close_breached.py, universal_pick_resolver.py, outcome_resolver.py,
check_resolver_health.py, resolve_stale_open_picks.py,
orphan_resolver_dryrun.py, prune_active_picks.py
βœ… Done
Source provenance tagging
_resolver_version + _resolver_source on every resolved pick
universal_pick_resolver.py βœ… Done
Enabled crypto VWAP/Bollinger
Changed env defaults from "0" to "1"
crypto_verified_wf.py βœ… Done
Theme B contamination documented
Root cause already patched (v2, 2026-04-28); historical re-resolution remains
outcome_resolver.py (analysis) βœ… Documented
We fixed the stopwatches so they all agree on when a race ends. We added name-tags to every score so we know who recorded it. And we turned on two crypto strategies that were sitting idle because someone forgot to flip the switch.
2

Standardized Validation Pipeline (3 new modules)

ModuleWhat it doesSize
admissibility_pipeline.py Unified 10-step standard replacing 6 fragmented validators. Every strategy must pass pre-registration, purged-embargoed walk-forward, DSR/PBO correction, block bootstrap, regime robustness, forward evidence, and stability checks before capital allocation. ~480 lines
cost_model.py Per-asset-class cost curves: CRYPTO 13bps, EQUITY 7bps, ETF 3bps, FOREX 2bps, COMMODITY 7bps, FUTURES 5.5bps, BOND 4.5bps. ~100 lines
Block bootstrap Integrated into admissibility_pipeline.py (Step 6). Preserves temporal dependence. β€”
Like a driving test: instead of six different examiners with six different checklists, there's now ONE test everyone must pass. Parallel park (walk-forward), highway merge (block bootstrap), emergency stop (DSR/PBO). Fail any one and you don't get a license to trade real money.
3

Concentration Monitoring

ToolWhat it measuresAlert threshold
concentration_monitor.py Herfindahl-Hirschman Index for symbol and source concentration across all active pick sources HHI > 0.25 (alert), HHI > 0.20 (warning)
A smoke detector for your portfolio. If too much money is riding on one stock or one strategy, it beeps.
4

Emitter Discipline

ModuleHow it gatesImpact
emitter_discipline.py Blocks picks from KILL/MONITOR_ONLY strategies BEFORE quality gates. 25 strategies hard-killed, 8 monitor-only, 42 proven
The bouncer at the club door. Before anyone gets to the dance floor (quality gates), they check if you're on the banned list.
5

Edge Development Wiring

ETF dual momentum already wired and enabled. CRYPTO gatekeeper confidence-inversion gate at 70 already active. Crypto VWAP/Bollinger strategies now default-enabled.

The good lemonade is finally connected to the taps. ETF dual momentum was already plumbed in but needed the "open" sign turned on.

πŸ“Š Best Picks Per Asset Class

Asset ClassBest Pick / StrategyEvidencePFWRSampleMoney-Ready?
CRYPTO deepseek_v4 SHORT (BTC/ETH) AI tournament #1; EAGLE3 SHORT 67% WR vs LONG 33% (n=216) 3.46 57.7% 273 PAPER ONLY
Why: deepseek_v4 is #1 ranked across 46 models with highest PF. The SHORT directional edge is backed by 216 tournament picks. EAGLE-4 flip is active in scanner. Production CRYPTO PF remains 0.97 β€” this is PAPER edge, not real money.
EQUITY BAC, JPM, MSFT, NVDA EAGLE3 tournament rankings; individual ~64% WR in paper N/A ~64% paper NO
Why NOT: Production EQUITY has PF 0.33, WR 26.9% on n=52. These symbols show edge in paper book but collapse in live production. WATCH, don't trade.
ETF ETF Dual Momentum (EEM, IWM, GLD) Only Tier-2 PASS: PF 1.60, WR 53.8%, n=104; WF OOS PF 1.21 1.60 53.8% 104 SHADOW PILOT
Why best candidate: ONLY strategy passing Tier-2 admissibility. Simple 12-1 month momentum with SPY-trend guard. Lowest concentration risk. Blocker: forward paper n<30. Shadow-size at 0.2% next step.
FOREX HARD-DISABLED β€” WR 33.3%, PF 0.48, n=45. Indistinguishable from random. NO
Why: 63% of wins were spread noise. After fixing TIME_EXIT + threshold, needs complete rebuild. Lift criteria: WR β‰₯ 55% on n β‰₯ 150, PF β‰₯ 1.5.
COMMODITY COT rehabilitation needed PF 0.69, WR 40.4%, n=712 but contaminated 0.69 40.4% 712 NO
FUTURES INSUFFICIENT β€” n=13, WR 15.4%. Concentration artifacts. NO
BOND NO DATA β€” n=0 live sample. NO

🧠 Statistical Arsenal β€” Already in Codebase, Now Enforced

Statistical TestImplemented?Enforced at emission?
Bonferroni Correctionβœ… 10+ implementationsβœ… Via admissibility pipeline
Benjamini-Hochberg FDRβœ…βœ… Via admissibility pipeline
DSR / PBO / SPAβœ… Now in pipelineβœ… Step 5 of 10
Monte Carlo (permutation)βœ… 20+ implementationsβœ… Via admissibility pipeline
Block Bootstrap CIβœ…βœ… Step 6 of 10
Walk-Forward Validationβœ…βœ… Step 3 of 10
We have all the ingredients for a five-star meal β€” Bonferroni spices, Monte Carlo pots, DSR oven β€” but the chef was using only the microwave. The new admissibility pipeline forces the chef to use ALL the equipment before serving.

πŸ—ΊοΈ 12-Week Forward Plan

Short-Term (Weeks 1-4)

Week 1-2
Resolver audit, concentration monitor daily runs, historical COMMODITY label purge
Week 3-4
Wire admissibility pipeline into CI/GHA as default gate. Pre-register all existing strategies.

Medium-Term (Weeks 5-9)

Week 5-6
Purged-embargoed WF on ETF dual momentum + crypto VWAP/Bollinger
Week 7
Shadow-size ETF at 0.2%, crypto at 0.2%. Monitor live PF/WR.
Week 8-9
Mutation testing on failed-but-plausible sleeves. Run mega_mutation_tournament.py.

Long-Term (Weeks 10-12+)

Week 10
Promote any sleeve meeting live PF β‰₯ 0.5, WR β‰₯ 55%
Week 11-12
Full-size rollout. Update Quant Ops Dashboard. Complete emitter audit.
Quarterly
Goal: β‰₯2 capital-ready sleeves, resolver dispute rate < 1%, aggregate HHI < 0.20
Short-term: Clean the kitchen and install the new oven. Medium-term: Bake the recipes we know work and start taste-testing. Long-term: Open for paying customers with proven recipes only. No serving before the food inspector signs off.

πŸ€– LiteLLM Integration

The LiteLLM proxy at http://localhost:4000/v1 is operational with 16 models. Three new modes tested and verified:

ModeStatusBest use
ollama-cloud-largeβœ… WorkingDeep strategy research, backtest methodology design
ollama-cloudβœ… WorkingBrainstorming, quick analysis, swarm synthesis
ollama-cloud-localβœ… WorkingSafety checks, conservative analysis

πŸ† Bottom Line

We fixed the plumbing. The FOREX resolver was broken at the pipe joints. The emitter system was a firehose with no pressure valve. Six different validation blueprints for the same house. The statistical arsenal was fully stocked but nobody was required to use it.

Where the real edge lives: AI tournament (deepseek_v4 PF 3.46), ETF dual momentum lab (PF 1.60), crypto VWAP/Bollinger walk-forward. None are money-ready yet β€” but they now have a clear, enforced path to get there.

What's still needed: Forward paper evidence. The admissibility pipeline gates the door. Shadow-sizing puts strategies through the flight simulator. Only then do they graduate to real capital.

We built a bridge from the research island to the production mainland. Good strategies were stuck on the island. The bridge is open. Now we walk each strategy across, one at a time, making sure it doesn't collapse under real-world weight.

Live Audit Dashboard Β· AI Tournament Β· AI Leaderboard Β· Pick Funnel Β· Research Index