10 investigations for durable edge across asset classes

RESEARCH SURFACE — NOT FINANCIAL ADVICE
Per-investigation assessment + implementation plans. No /audit production gate changes from this doc. Numbers cited come from audit_dashboard/data/dashboard_data.json at the timestamp below.
Published 2026-05-11T19:35Z · Branch feat/audit-dashboard-enhancements-hermes-2026-05-09 · Data snapshot generated_at 2026-05-10T04Z

Tl;dr

Ten angles surfaced for durable cross-class edge. 4 already built or partially built, 6 net-new. Highest-value net-new pair: #1 rolling-window profiling × #2 edge-decay heatmap (shared compute kernel, single dashboard widget, no production-strategy risk).

System already has: rolling-30d metrics, walk-forward by_class, hf_decay_watchlist, regime annotations (VIX/BTC.D/DXY), concept_drift KS statistics, TA baseline benchmark grid, walk-forward Tier-1 promotion gate, benchmark-relative 30d excess return per system.

What's missing: rolling-window panels at multiple horizons, edge-decay heatmap, top-N portfolio slice simulator, meta-learning gate-pass predictor, formal peer-review rubric.

#1 Temporal-stability profiling — rolling Sharpe/Sortino/Consistency/Excess at 7d/30d/90d/1y/3y

PARTIAL~1d work

Current state

hf_stats.rolling_metrics already emits a 14-row history but only at one window (window_days=30). Most recent: 2026-04-22 net_sharpe 0.1605 / WR 42.98% / n=3134 / max_drawdown_pct 721.91 / ulcer_index 362.7. Earlier 2026-04-09: net_sharpe 0.3695 / WR 48.05% / n=768 / max_drawdown_pct 105.17. Sharpe halved and ulcer 6×'d in 13d — visible because rolling exists, invisible at multiple horizons because they don't.

Plan

  1. Add ROLLING_WINDOWS_DAYS = [7, 30, 90, 365, 1095] in audit_trail/dashboard_generator.py.
  2. Parameterize the existing _compute_rolling_metrics(closed, window_days) helper; emit hf_stats.rolling_by_window[window_days] as a list of timestamp-keyed rows.
  3. Per asset class: pre-group closed by asset_class, run the same kernel n_classes × n_windows times. Skip windows where n_trades < 30 (statistical floor).
  4. Frontend: add Chart.js line chart of net_sharpe per window, one line per asset class. Stable + monotonically-improving line at long horizons = durable; short-window noise but stable long = healthy convergence.

Risk

LOW. Read-only on closed picks; no production-gate change. CPU adds <3s per dashboard build (verified by extrapolation from current 14-row build).

#2 Edge-decay heatmap — class × horizon → excess-return matrix

NOT BUILT~0.5d on top of #1

Rationale

Pairs naturally with #1: same rolling-window kernel, different cell aggregation. Highlights classes where edge erodes fast (CRYPTO post-quarantine) vs. classes that stay robust (ETF walk-forward consistency 100% on 4 folds).

Plan

  1. Compute matrix excess_return[class][window_days] where excess = sum(pnl_pct in window) − benchmark_return(class, window_days). Benchmark already wired in tools/live_market_fetcher.benchmark_return(); commit cf229ea31ba added per-system 30d. Extend for 7/30/90/365/1095.
  2. Emit under edge_decay_heatmap top-level key in dashboard_data.json.
  3. Frontend: CSS grid color cells (red < −5pp, amber 0±5pp, green > +5pp) with cell-tooltip showing n_trades + benchmark used.
  4. Sort columns left-to-right shortest-to-longest horizon so monotonic-decay strategies show a clear color gradient.

#3 Cross-symbol variance — per-class std-dev of edge metrics across symbols

PARTIAL~0.5d

Current state

tools/run_tv_backtest_benchmark.py already emits per-symbol PF/Sharpe/WR/MDD/trades across 7 symbols. Sample 7-symbol full run (reports/tv_backtest_benchmark_20260511T173937Z.json): only QQQ:rsi passes robustness≥0.60 AND trades≥5 across the entire grid. That's the cross-symbol variance signal — 6/7 winners are statistical noise.

Plan

  1. Same grid expanded to 10 symbols per class via per-class watchlist in tools/run_tv_backtest_benchmark.py: CRYPTO=BTC/ETH/SOL/AVAX/MATIC/LINK/DOGE/XRP/ADA/BNB; EQUITY=SPY/QQQ/IWM/AAPL/MSFT/NVDA/META/AMZN/GOOGL/TSLA; etc.
  2. Emit std_dev_by_class[class][metric] = standard deviation across symbols. High std on PF + low std on WR = symbol-luck issue; low std on both = class-wide signal.
  3. Wire by_class.cross_symbol_std into TA-baseline panel renderer (Opt A, commit 4ea32d227cf).

#4 Strategy-type clustering — group by feature family, compare cluster Sharpe

NOT BUILT~2d

Current state

System has 30+ named strategies but no canonical feature taxonomy. The peer-research orchestrator (PR 3) keyword-routes spec.entry text to 6 signal handlers (sma_cross / rsi_mr / momentum / mean_reversion_zscore / breakout / buy_and_hold) — that's the closest existing taxonomy. alpha_engine/ml_ranker.py + feature_health.py exist but operate per-pick, not per-strategy-family.

Plan

  1. Tag each strategy with binary feature vector: [uses_ma, uses_rsi, uses_volume, uses_sentiment, uses_orderbook, uses_funding, uses_breakout, uses_mean_reversion]. Hand-curated for ~30 strategies, ~1h work.
  2. K-means k=4 on the binary matrix. Likely surfaces: (a) MA-cross trend, (b) mean-reversion oscillator, (c) breakout/volume-impulse, (d) ML-ensemble blend.
  3. Per cluster compute rolling Sharpe at 30d/90d/1y across the same windows used in #1. Resilient cluster = stable Sharpe across regime shift (2026-04-22 VIX collapse pivot).
  4. Emit strategy_clusters top-level key. Render as small grouped bar chart per cluster.

Caveat

Hand-curated feature tags are subjective. Consider auto-tagging via LLM (cheap-engine call per strategy with spec text) once #6 v3 spec-translator from the research orchestrator is shipped.

#5 Regime-sensitive edge weighting — tag picks with prevailing regime, recompute per regime

PARTIAL~1d

Current state

Heavy infrastructure already exists. tools/live_market_fetcher.py classifies VIX (COMPLACENCY/NORMAL/ELEVATED/PANIC), BTC.D (RISING_STRONG/RISING_MILD/FALLING), DXY (USD_WEAK/STRONG/FLAT), equity regime (RIPPING/GAINING/FLAT/FALLING). audit_trail/quality_gates.py:4111 annotates picks at gate time. regime_validation block exists in dashboard_data.json but currently TRENDING_UP/DOWN/RANGING/HIGH_VOL/CRASH all show total=0 — the regime tag isn't being persisted into closed-pick rows.

Plan

  1. Bug-fix: audit_trail/quality_gates.py computes regime tags but doesn't persist them to pick.regime_tag. Add persistence step in passes_active_gate.
  2. Backfill closed picks by joining closed.timestamp against the historical regime cache. Need a regime-history file — today live_market_regime.json is point-in-time. Add daily snapshot to tools/live_market_fetcher.py writing to audit_dashboard/data/regime_history/.json.
  3. Compute per_regime_metrics[regime][class] = {sharpe, wr, n_trades}. Strategies that only deliver in one regime get fragile=true flag.
  4. Surface in dashboard: per-system "edge regime profile" badge (regime in which the system was best, n, Sharpe).

Risk note

Concept-drift root-cause report (reports/concept_drift_root_cause_2026-05-11.md) confirmed VIX -44.64% / 30d collapse is the real driver. Most current "edge" was earned in PANIC-vol regime that no longer exists. Without per-regime tagging, every Tier-2 claim is implicitly regime-conditioned.

#6 Locked-window OOS forward-test — freeze model at T-30, run T+30 forward-only

BUILT~0d (wire-in only)

Current state

Walk-forward by_class in alpha_engine/walkforward_validator.py already does locked-window train/test splits. dashboard_data.json::walkforward.by_class: ETF folds=4 consistency=100% oos_sharpe=11.41; EQUITY folds=8 consistency=75% oos_sharpe=6.43; CRYPTO folds=25 consistency=84% oos_sharpe=2.57; FOREX folds=52 consistency=48.1% oos_sharpe=-3.74. Opt B (commit cf4e924744a) wired this into Tier-1 promotion gate.

What's missing

  1. Per-strategy walk-forward. Currently only by_class. walk_forward_by_strategy() would isolate which strategies pass/fail OOS within a class. Useful as a kill-list seed.
  2. Forward-only "post-quarantine" cohort. Lock dataset cutoff at quarantine commit timestamp; run 30d forward; compare to walk-forward OOS to validate the quarantine improved system Sharpe. Use commit d884694ace2 (2026-05-10 P0 quarantine) as cutoff.
  3. Surface as walkforward.by_strategy in dashboard payload + table in /audit.

#7 Edge-per-capital-allocation slice — top-N portfolio Monte Carlo

NOT BUILT~1.5d

Rationale

Settles concentration-vs-diversification debate empirically. Current 6 Tier-2 verified systems span PF 1.84 to PF 19.19 — equal-weight is probably wrong, max-weight on PF 19.19 (multi_asset_cot n=130) is probably also wrong.

Plan

  1. For each class, rank systems by recent excess_return_30d_pct (already wired by W4, commit cf229ea31ba).
  2. For N in {1, 3, 5, 10, 20}: simulate equal-weight top-N portfolio. Walk-forward by 30-day rebalance.
  3. Emit portfolio_topN[class][N] = {sharpe, mdd, n_trades, holding_period_mean, turnover_pct}.
  4. Render Pareto frontier: x=concentration (N), y=Sharpe, color=MDD. Inflection point = optimal N.

#8 Feature-importance drift detection — per-feature KS statistics, flag > 0.2

PARTIAL~1.5d

Current state

hf_stats.concept_drift emits ONE system-wide KS_D=0.313 (vs critical 0.047). That's output-distribution drift on pnl_pct. Input-feature drift is not computed. alpha_engine/feature_health.py + ml_drift_repair_workflow.py exist but are not wired to dashboard_data.json.

Plan

  1. Pick 8 canonical features per asset class. For CRYPTO: 1d_realized_vol, 7d_realized_vol, volume_z, funding_rate, oi_chg_24h, btc_d, vix, dxy_chg_30d.
  2. Snapshot daily feature distributions to audit_dashboard/data/feature_dist_history/.parquet (or .json if no parquet available).
  3. Compute KS-D between current distribution and 90d-back distribution per feature. Flag when KS > 0.20.
  4. Emit feature_drift[class][feature] = {ks_D, ks_critical, alert_on}.
  5. Dashboard surface: small drift-grid alongside concept_drift block.

Caveat

Snapshot history needs at least 90d of accumulated data before this is actionable. Start writing snapshots today so the readout is meaningful in Aug.

#9 Meta-learning edge estimator — predict walk-forward gate-pass probability

NOT BUILT~3d

Plan

  1. Build labeled dataset: every (strategy, class, timestamp) row labeled gate_pass = 1 if walk-forward consistency ≥ 60 AND oos_sharpe > 0 (Opt B gate definition), else 0.
  2. Features per row: trailing-30d PF, trailing-30d Sharpe, n_trades_30d, recent_drawdown, regime tags, concept_drift_KS, cross-symbol_std (from #3).
  3. Logistic regression first (interpretable, low overfit risk). Cross-validate with purged-KFold (no temporal leakage). Already have mlfinlab purged-CV shim in repo (per project_next_phase_integrations_2026_04_22.md).
  4. Emit predicted p_gate_pass per system in dashboard_data.json.
  5. Display as 0-100 "trust score v2" alongside existing trust_score (ρ=+0.196 per project_performance_reality.md).

Risk

MEDIUM. ML estimator over a small training set (n < 200 strategies) can overfit. Use walk-forward gate as the ground truth label only after Opt B (commit cf4e924744a) accumulates ~3mo of demotions for label balance.

#10 Peer-review "edge-confidence" rubric — 2-analyst scoring per candidate

NOT BUILT~0.5d (rubric); ongoing labor (reviews)

Plan

  1. Draft docs/EDGE_REVIEW_RUBRIC.md with 6-axis scoring 1-5: (1) data-quality, (2) regime-fit, (3) statistical significance (n, p-value), (4) backtest-leakage controls (purge gap, embargo), (5) cross-symbol generalization, (6) cost-honesty (slippage + funding + execution latency modeled).
  2. Score range 6-30. Auto-promote candidates only if BOTH reviewers score ≥ 22 and walk-forward gate passes.
  3. Record scores in audit_dashboard/data/peer_review/_.json. Surface latest score on /audit system card.
  4. Run via swarm: deepseek + xai + cerebras each score independently as cheap multi-AI proxy for human reviewers. Cost ~$0.15 per system.

Priority matrix

#InvestigationStateEffortRiskPairs with
1Rolling-window profilingPARTIAL1dLOW#2, #5, #6
2Edge-decay heatmapNEW0.5dLOW#1
6Locked-window OOSBUILT0dLOWOpt B already shipped
5Regime-sensitive weightingPARTIAL (bug-fix)1dLOW#1, #8
3Cross-symbol variancePARTIAL0.5dLOWOpt A panel
7Top-N portfolio Monte CarloNEW1.5dMEDW4 (excess_return_30d)
10Peer-review rubricNEW0.5d + laborLOW#9
4Strategy clusteringNEW2dMEDResearch orchestrator v3
8Feature-drift KSPARTIAL1.5dMED#5 regime history
9Meta-learning gate-passNEW3dMED-HIGH#10, #3

Quick-win combo — ship next

Wave 1 (1.5d): #1 + #2. Shared kernel, single dashboard widget, zero production-strategy risk. Surfaces edge durability per class at a glance.

Wave 2 (1d): #5 bug-fix + #3. Persists regime tag on picks (closes the regime_validation.regime_wr_breakdown all-zero rows). Extends Opt A panel with cross-symbol std-dev.

Wave 3 (1.5d): #7. Top-N portfolio simulator. Settles concentration debate empirically before any real-money sizing decision.

Total: ~4d effort for 4 of the 10 angles, no production-gate risk, additive to dashboard only.

What this connects to (already-shipped this session)

References

Generated 2026-05-11. Research surface — not financial advice. See /audit/ for live dashboard.