Research · Beta · vN.3 · offline · human-gated

Agent Pipeline · 研究编排

How a small DAG of single-purpose agents turns the current macro regime into a reviewable strategy proposal — fully audited, offline, and stopped at a human gate.

This is the offline research pipeline (vN.3) behind the portfolio — NOT an auto-trader. It reuses the tested backtest engine (vN.1) and bounded search (vN.2), then layers a red-team critic and a human gate on top. Every provider call is logged to an audit trail; the pipeline writes ONLY under research/proposals/ and never touches the live book in data/.

The pipeline

Read it like an orchestrator-workers system: a control plane (the Orchestrator) drives a data plane of single-purpose agents left to right, every call is logged, and the only decision point is a human gate — the machine never writes to the live book itself.

%%{init: {"theme":"base","themeVariables":{"fontFamily":"ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif","fontSize":"14px","background":"#faf9f5","lineColor":"#8a8576","primaryTextColor":"#141413"},"flowchart":{"htmlLabels":true,"nodeSpacing":50,"rankSpacing":58,"padding":16,"useMaxWidth":true,"curve":"basis"}}}%% flowchart TB IN(["data/regime.yaml + price panel offline inputs"]):::store subgraph CTRL["control plane"] ORCH["Orchestrator · the conductor drives every call logs audit · hashes proposal_id"]:::det end subgraph DP["single-purpose agents · data plane (offline)"] direction LR A1["RegimeAgent pure read → views no LLM · no RNG"]:::det A2["Signals factor z-scores"]:::det A3["HypothesisAgent falsifiable thesis LLM-optional"]:::llm A4["Generator + Search bounded replay search"]:::det A5["CriticAgent · red team DSR gate + stress re-sim produces accept flag"]:::redteam A6["CuratorAgent base / bull / bear drafts"]:::det A1 --> A2 --> A3 --> A4 --> A5 --> A6 end ART[("research/proposals/ID/ 5 artifacts + audit.jsonl never writes data/")]:::store GATE{"HUMAN GATE reviews drafts + verdict"}:::human LIVE(["human manually copies weights → data/ live book"]):::term ARCH(["archived · zero deploy"]):::term IN --> A1 A6 --> ART --> GATE GATE -- approve --> LIVE GATE -- decline --> ARCH ORCH -. drives + logs .-> DP classDef det fill:#efede4,stroke:#a39e8f,color:#141413; classDef llm fill:#dbe8f4,stroke:#6a9bcc,color:#234a68,stroke-width:2px,stroke-dasharray:5 3; classDef redteam fill:#f6ddd0,stroke:#d97757,color:#8a3a1d,stroke-width:2px; classDef store fill:#ece8dc,stroke:#b0aea5,color:#57534b; classDef human fill:#e1e8d3,stroke:#788c5d,color:#3c4a28,stroke-width:2px; classDef term fill:#faf9f5,stroke:#cdc9bc,color:#57534b; style CTRL fill:#f4f2ea,stroke:#dcd8cb,color:#57534b; style DP fill:#f4f2ea,stroke:#dcd8cb,color:#57534b;

Figure A · Orchestration topology. Node colour marks each step's trust level (key below); the dashed arrow is control + logging, solid arrows are data flow.

deterministic — rule-based, no LLM, no RNG LLM-optional — rule-based by default; Claude only with a key red team — the critic that tries to reject store / artifact — audited, written only under research/proposals/ human gate — the only decision point; machine never auto-accepts

Step by step

1

RegimeAgent

PURE read of data/regime.yaml → coarse-class views (no LLM, no RNG).
2

Signals

Real cross-asset factor z-scores (momentum, defensive) from the price panel.
3

HypothesisAgent

Regime + signals → an explicit, falsifiable thesis (orientation, statement, falsification).
4

Generator + Search

Build a bounded vN.2 search space from the thesis, run the multi-window replay search (not yet a true walk-forward — see limitation below).
5

CriticAgent

DSR gate (deflated Sharpe) + REAL asset-shock stress re-sim against the finalist.
6

Curator + Orchestrator

Compile drafts, write proposal + audit; hand off to a HUMAN.

Single-purpose agents

RegimeAgent

owns: the numbers

Aggregates the fine tactical matrix (OW=+1/N=0/UW=−1) up to coarse-class scores. Deterministic — owns no prose.

HypothesisAgent

owns: the thesis

Turns the regime view + real signal z-scores into an explicit orientation per class, a statement, and 4–6 falsification conditions. Audited.

Generator + Search

owns: the candidates

Derives a bounded search space from the thesis (sign fixed, magnitude searched), then ranks trials by a multi-window replay objective (not yet true OOS — see limitation).

CriticAgent

owns: the red team

A strict DSR gate (null/negative evidence is rejected) plus a real stress re-sim: r_group = Σ wᵢ·shockᵢ on the finalist’s own weights.

CuratorAgent

owns: the drafts

Compiles base/bull/bear weights (each sums to 1, passes the constraints) and decision-time fields. Reuses the hypothesis’ statement + falsification.

Orchestrator

owns: provenance

Hashes a reproducible proposal_id, replays the audit log, writes the 5 artifacts, appends the leaderboard. Writes ONLY under research/proposals/.

Invariant 1 · Human gate

The pipeline writes ONLY under research/proposals/ — it never creates or edits anything under data/. A test asserts `git status data/` is unchanged across a run. A human reviews the drafts and, if accepted, manually copies weights into the live book.

Invariant 2 · Offline

No network is required. The default provider is deterministic and rule-based; the Anthropic SDK is imported only inside the Claude provider when a key is present. CI runs the deterministic path, so a proposal_id is reproducible.

Worked example (latest proposal)

proposal_id c4e3bf45fbb2 · provider rulebased · grid/seed · deterministic=true · code a0515a0 · data 376e75e

Regime → Hypothesis

Regime quadrant Q4 (growth momentum -0.519, inflation momentum +0.275). Overweight tilt orientation: commodities, rates. Underweight tilt orientation: equities. Coarse-class views are aggregated from the fine tactical_matrix (OW=+1/N=0/UW=-1, mean per coarse class); the sign sets the search bound orientation, the magnitude is searched.

commodities: OWcredit: Nequities: UWrates: OW

live signal factors: defensive, momentum

Finalist

base_allocator: 60_40
tilt_strength: 0
Replay Sharpe (multi-window): 0.9339
on: 85 obs · 3 splits · 162 trials

Critic verdict (red team)

Deflated Sharpe: 0.3361 (SR0 1.67)
accept: false
stress basis: finalist_asset_shock_resim
stress flags: inf2022

In this run the critic re-simulated all 5 historical scenarios against the finalist’s own weights and flagged inf2022; the strict DSR gate returned accept=false. That is the system working as intended — on thin, single-regime data it is supposed to withhold approval. The drafts still wait at the human gate.

⚠ What "Replay Sharpe" means here (honest limitation)

The search ranks static weight vectors mark-to-market across test windows. Signal z-scores are recomputed per fold using only data before each window (look-ahead removed); but it still ranks static weights — no per-fold model retraining beyond signals — and transaction costs are not yet modeled (single rebalance ⇒ zero turnover). So this is a multi-window replay with train-only signals, NOT a validated expanding walk-forward. Treat every number here as in_frame_pass / R0-R1 framework illustration. The example above is regenerated under this train-only code with a dimension-corrected Deflated Sharpe (annual SR → per-period, SR0 from the empirical cross-trial variance): on this thin single-regime data DSR ≈ 0.34, well below the 0.95 significance bar, so the critic REJECTS its own candidate — the system withholding approval by design. Remaining fix (transaction costs + multi-rebalance) tracked in docs/09 §0.2.

Audit trail · what the orchestrator actually ran

Every provider call the orchestrator drove for this proposal, in order, replayed from audit.jsonl. model is empty because this run used the deterministic rule-based provider — the Claude provider is only imported when a key is configured. RegimeAgent (a pure read) and the Orchestrator (writes artifacts only) make no provider call, so they do not appear here.

%%{init: {"theme":"base","themeVariables":{"fontFamily":"ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif","fontSize":"16px","background":"#faf9f5","actorBkg":"#efede4","actorBorder":"#a39e8f","actorTextColor":"#141413","actorLineColor":"#c4c0b2","signalColor":"#8a8576","signalTextColor":"#3c382f","noteBkg":"#dbe8f4","noteBorderColor":"#6a9bcc","noteTextColor":"#234a68","activationBkgColor":"#f6ddd0","activationBorderColor":"#d97757","sequenceNumberColor":"#faf9f5","labelBoxBkgColor":"#e1e8d3","labelBoxBorderColor":"#788c5d","labelTextColor":"#3c4a28","loopTextColor":"#3c4a28"},"sequence":{"useMaxWidth":false,"actorMargin":60,"boxMargin":14,"noteMargin":12,"messageMargin":42,"mirrorActors":true}}}%% sequenceDiagram autonumber participant O as Orchestrator participant R as RegimeAgent participant S as Signals participant H as HypothesisAgent participant G as Generator+Search participant C as CriticAgent participant U as CuratorAgent participant L as audit.jsonl participant Hum as Human Note over O,L: offline · deterministic by default · provider=rulebased · model=none O->>R: read regime to coarse views R-->>O: views (no provider call, not logged) O->>S: compute factor z-scores S-->>O: momentum / defensive O->>H: state hypothesis + falsification H-->>L: log regime_summary, falsification H-->>O: thesis + 4 to 6 falsifiers O->>G: bounded multi-window replay search G-->>L: log search_space G-->>O: finalist params O->>C: critique (DSR + stress re-sim) C-->>L: log critique → accept=false, flag inf2022 C-->>O: verdict O->>U: compile drafts U-->>L: log rationale U-->>O: base / bull / bear drafts O->>O: hash proposal_id · write 5 artifacts O->>Hum: hand off drafts + verdict Note over Hum: machine NEVER auto-accepts alt human approves Hum->>Hum: manually copy weights → data/ else human declines Hum->>Hum: archive · zero deploy end

Figure B · The same run as a sequence trace (scroll horizontally to follow all nine lanes). Blue note = run mode; the orange activation bar marks the critic; only the four agents that hit a provider write to audit.jsonl; the run ends at the human gate (approve / decline).

The logged calls

01

hypothesis · regime_summary provider=rulebased model=—

state the macro hypothesis from the regime view
02

hypothesis · falsification provider=rulebased model=—

falsification conditions for the hypothesis
03

generator · search_space provider=rulebased model=—

generate vN.2 search_spec for the current regime
04

critic · critique provider=rulebased model=—

critique the finalist from DSR + stress context
05

curator · rationale provider=rulebased model=—

decision rationale prose

Honesty

The committed price history is thin and single-regime (~120 trading days, one Q4 macro regime). There is no bull/bear transition in-sample, so the regime tilt cannot be validated out-of-regime.
Stress shocks are window-magnitude estimates (from each scenario’s benchmark line), not per-ETF actuals — framework validation only.
These proposals are illustrative, NOT robust. Do not deploy on this evidence alone — which is exactly why everything stops at a human gate.

Source: research/agents/* (orchestrator + agents), research/engine/* (vN.1 engine + signals), research/search/* (vN.2 search). Every proposal ships 5 artifacts — proposal.md, rebalance_draft.yaml, decision_draft.yaml, audit.jsonl, config.yaml — committed to the public repo with a reproducible proposal_id.