diff --git a/docs/api/index.md b/docs/api/index.md index 57f36bb7..cc423932 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -2648,7 +2648,7 @@ Test seam — inject the worktree-dirty check (defaults to `git status`). ### ImproveOptions -Defined in: [improvement/improve.ts:48](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L48) +Defined in: [improvement/improve.ts:47](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L47) #### Type Parameters @@ -2666,7 +2666,7 @@ Defined in: [improvement/improve.ts:48](https://github.com/tangle-network/agent- > `optional` **surface?**: [`ImproveSurface`](#improvesurface) -Defined in: [improvement/improve.ts:51](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L51) +Defined in: [improvement/improve.ts:50](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L50) Which profile lever to optimize. Default `'prompt'`. Selects the default generator + the baseline-surface extraction shape. @@ -2675,7 +2675,7 @@ Which profile lever to optimize. Default `'prompt'`. Selects the default > `optional` **generator?**: `SurfaceProposer`\<`unknown`\> -Defined in: [improvement/improve.ts:55](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L55) +Defined in: [improvement/improve.ts:54](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L54) The `SurfaceProposer` that mutates the surface. When unset, the facade picks the default for `surface` (`gepaProposer` for prompt, `skillOptProposer` @@ -2685,7 +2685,7 @@ The `SurfaceProposer` that mutates the surface. When unset, the facade > `optional` **gate?**: `"none"` \| `"holdout"` -Defined in: [improvement/improve.ts:58](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L58) +Defined in: [improvement/improve.ts:57](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L57) Gate mode. `'holdout'` (default) runs the held-out promotion gate; `'none'` is a baseline-only run (`budget.generations = 0`). @@ -2694,7 +2694,7 @@ Gate mode. `'holdout'` (default) runs the held-out promotion gate; > **scenarios**: `TScenario`[] -Defined in: [improvement/improve.ts:60](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L60) +Defined in: [improvement/improve.ts:59](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L59) Scenarios to evaluate against. Passthrough to `selfImprove`. @@ -2702,7 +2702,7 @@ Scenarios to evaluate against. Passthrough to `selfImprove`. > **judge**: `JudgeConfig`\<`TArtifact`, `TScenario`\> -Defined in: [improvement/improve.ts:62](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L62) +Defined in: [improvement/improve.ts:61](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L61) Judge that scores artifacts. Passthrough to `selfImprove`. @@ -2710,7 +2710,7 @@ Judge that scores artifacts. Passthrough to `selfImprove`. > **agent**: (`surface`, `scenario`, `ctx`) => `Promise`\<`TArtifact`\> -Defined in: [improvement/improve.ts:65](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L65) +Defined in: [improvement/improve.ts:64](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L64) The agent under improvement — same shape as `selfImprove.agent`: it takes the current surface + scenario + ctx and returns the artifact to judge. @@ -2737,7 +2737,7 @@ The agent under improvement — same shape as `selfImprove.agent`: it takes > `optional` **budget?**: `SelfImproveBudget` -Defined in: [improvement/improve.ts:67](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L67) +Defined in: [improvement/improve.ts:66](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L66) Budget + loop shape. Passthrough; `gate: 'none'` forces `generations = 0`. @@ -2745,7 +2745,7 @@ Budget + loop shape. Passthrough; `gate: 'none'` forces `generations = 0`. > `optional` **llm?**: `SelfImproveLlm` -Defined in: [improvement/improve.ts:70](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L70) +Defined in: [improvement/improve.ts:69](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L69) LLM config. Passthrough to `selfImprove` AND used to construct the default reflective proposer (`gepaProposer`/`skillOptProposer`) when `generator` is unset. @@ -2754,7 +2754,7 @@ LLM config. Passthrough to `selfImprove` AND used to construct the default > `optional` **allowedModels?**: readonly `string`[] -Defined in: [improvement/improve.ts:74](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L74) +Defined in: [improvement/improve.ts:73](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L73) Restrict the run to this subset of models. When set, the reflection model (`llm.model`, or the default when unset) must be a member, or `improve()` throws @@ -2764,7 +2764,7 @@ Restrict the run to this subset of models. When set, the reflection model ### ImproveResult -Defined in: [improvement/improve.ts:77](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L77) +Defined in: [improvement/improve.ts:76](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L76) #### Type Parameters @@ -2782,7 +2782,7 @@ Defined in: [improvement/improve.ts:77](https://github.com/tangle-network/agent- > **profile**: `AgentProfile` -Defined in: [improvement/improve.ts:80](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L80) +Defined in: [improvement/improve.ts:79](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L79) The profile after improvement: the winner surface applied back into the matching field when the gate shipped, else the input profile unchanged. @@ -2791,7 +2791,7 @@ The profile after improvement: the winner surface applied back into the > **shipped**: `boolean` -Defined in: [improvement/improve.ts:82](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L82) +Defined in: [improvement/improve.ts:81](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L81) True when `gateDecision === 'ship'`. @@ -2799,7 +2799,7 @@ True when `gateDecision === 'ship'`. > **lift**: `number` -Defined in: [improvement/improve.ts:84](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L84) +Defined in: [improvement/improve.ts:83](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L83) Held-out lift (`winner − baseline` composite). @@ -2807,7 +2807,7 @@ Held-out lift (`winner − baseline` composite). > **gateDecision**: `"ship"` \| `"hold"` \| `"need_more_work"` \| `"model_ceiling"` \| `"arch_ceiling"` -Defined in: [improvement/improve.ts:86](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L86) +Defined in: [improvement/improve.ts:85](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L85) The five-valued gate verdict from `selfImprove`. @@ -2815,7 +2815,7 @@ The five-valued gate verdict from `selfImprove`. > **raw**: `SelfImproveResult`\<`TScenario`, `TArtifact`\> -Defined in: [improvement/improve.ts:88](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L88) +Defined in: [improvement/improve.ts:87](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L87) Full `selfImprove` result for advanced inspection. @@ -6380,7 +6380,7 @@ Verifies the edited worktree. Sync or async; throws only on a setup fault > **ImproveSurface** = `"prompt"` \| `"skills"` \| `"tools"` \| `"mcp"` \| `"hooks"` \| `"code"` -Defined in: [improvement/improve.ts:46](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L46) +Defined in: [improvement/improve.ts:45](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L45) The agent-profile lever `improve` optimizes. Mirrors the AgentProfile-law profile levers; `code` is the implementation-tier surface. @@ -7774,7 +7774,7 @@ Defined in: [improvement/build-prompts.ts:43](https://github.com/tangle-network/ > **improve**\<`TScenario`, `TArtifact`\>(`profile`, `findings`, `opts`): `Promise`\<[`ImproveResult`](#improveresult)\<`TScenario`, `TArtifact`\>\> -Defined in: [improvement/improve.ts:200](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L200) +Defined in: [improvement/improve.ts:199](https://github.com/tangle-network/agent-runtime/blob/main/src/improvement/improve.ts#L199) Run the held-out-gated self-improvement loop on ONE profile surface. diff --git a/docs/api/lifecycle.md b/docs/api/lifecycle.md index b90f32e4..246365c4 100644 --- a/docs/api/lifecycle.md +++ b/docs/api/lifecycle.md @@ -1376,7 +1376,7 @@ The shared baseline eval (the "without" arm, measured once). ### SkillDraft -Defined in: [lifecycle/skill-generator.ts:32](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L32) +Defined in: [lifecycle/skill-generator.ts:33](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L33) A distilled skill draft: a name + the `SKILL.md` body. @@ -1386,7 +1386,7 @@ A distilled skill draft: a name + the `SKILL.md` body. > **name**: `string` -Defined in: [lifecycle/skill-generator.ts:34](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L34) +Defined in: [lifecycle/skill-generator.ts:35](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L35) Skill name — becomes the inline resource ref name + the artifact name. @@ -1394,7 +1394,7 @@ Skill name — becomes the inline resource ref name + the artifact name. > **content**: `string` -Defined in: [lifecycle/skill-generator.ts:36](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L36) +Defined in: [lifecycle/skill-generator.ts:37](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L37) The `SKILL.md` document body (markdown). @@ -1402,7 +1402,7 @@ The `SKILL.md` document body (markdown). > `optional` **description?**: `string` -Defined in: [lifecycle/skill-generator.ts:38](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L38) +Defined in: [lifecycle/skill-generator.ts:39](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L39) Optional one-line description for review surfaces. @@ -1410,7 +1410,7 @@ Optional one-line description for review surfaces. ### SkillGeneratorOptions -Defined in: [lifecycle/skill-generator.ts:56](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L56) +Defined in: [lifecycle/skill-generator.ts:57](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L57) #### Properties @@ -1418,7 +1418,7 @@ Defined in: [lifecycle/skill-generator.ts:56](https://github.com/tangle-network/ > **distill**: [`DistillSkills`](#distillskills) -Defined in: [lifecycle/skill-generator.ts:58](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L58) +Defined in: [lifecycle/skill-generator.ts:59](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L59) REQUIRED — the create step. Without it there is no skill to optimize. @@ -1426,7 +1426,7 @@ REQUIRED — the create step. Without it there is no skill to optimize. > `optional` **refine?**: [`RefineSkill`](#refineskill) -Defined in: [lifecycle/skill-generator.ts:60](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L60) +Defined in: [lifecycle/skill-generator.ts:61](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L61) OPTIONAL — the optimize step. Omit to ship distilled drafts unrefined. @@ -1917,7 +1917,7 @@ test injects a pure function. Returns up to `count` drafts. > **DistillSkills** = (`ctx`) => `Promise`\<[`SkillDraft`](#skilldraft)[]\> \| [`SkillDraft`](#skilldraft)[] -Defined in: [lifecycle/skill-generator.ts:47](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L47) +Defined in: [lifecycle/skill-generator.ts:48](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L48) DISTILL — create new skill drafts from the agent's history. Returns zero or more drafts (zero is valid: nothing worth distilling this round). The @@ -1940,11 +1940,11 @@ LLM; a test injects a pure function. > **RefineSkill** = (`draft`) => `Promise`\<[`SkillDraft`](#skilldraft)\> \| [`SkillDraft`](#skilldraft) -Defined in: [lifecycle/skill-generator.ts:54](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L54) +Defined in: [lifecycle/skill-generator.ts:55](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L55) REFINE — improve ONE distilled draft (wording, structure, examples). The -production implementation wraps `runSkillOpt`. Returns the refined draft; when -omitted from `skillGenerator`, the distilled draft is used as-is. +production implementation drives `skillOptProposer`. Returns the refined draft; +when omitted from `skillGenerator`, the distilled draft is used as-is. #### Parameters @@ -2540,7 +2540,7 @@ Cold start on a fixture domain (the closed loop in one call): > **skillGenerator**(`opts`): [`CandidateGenerator`](#candidategenerator)\<`"skill"`\> -Defined in: [lifecycle/skill-generator.ts:74](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L74) +Defined in: [lifecycle/skill-generator.ts:75](https://github.com/tangle-network/agent-runtime/blob/main/src/lifecycle/skill-generator.ts#L75) Build a `CandidateGenerator` for the skill surface that distills new skills from history, then (optionally) refines them, and emits each as a `skill` @@ -2559,10 +2559,10 @@ artifact carrying an inline `SKILL.md` resource ref. #### Example ```ts -Production wiring (distill = LLM reflection, refine = skillOpt): +Production wiring (distill = LLM reflection, refine = skillOptProposer): skillGenerator({ distill: reflectiveDistill, // creates the draft from traces - refine: skillOptRefine, // optimizes the draft + refine: skillOptRefine, // optimizes the draft via skillOptProposer }) ``` diff --git a/docs/canonical-api.md b/docs/canonical-api.md index fbd728bc..0d9779df 100644 --- a/docs/canonical-api.md +++ b/docs/canonical-api.md @@ -2,7 +2,7 @@ -> **Version 0.74.0.** Per-symbol signatures live in the generated `docs/api/` reference (one page per module). The pinned substrate is agent-eval `>=0.95.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`) are owned by **`@tangle-network/agent-interface`** (peer `>=0.10.0 <1.0.0`) — the single source of truth. Substrate symbols (`selfImprove`/`gepaProposer`/`defaultProductionGate`/`heldOutGate`/`pairedBootstrap`/…) are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package. +> **Version 0.75.0.** Per-symbol signatures live in the generated `docs/api/` reference (one page per module). The pinned substrate is agent-eval `>=0.97.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`) are owned by **`@tangle-network/agent-interface`** (peer `>=0.10.0 <1.0.0`) — the single source of truth. Substrate symbols (`selfImprove`/`gepaProposer`/`defaultProductionGate`/`heldOutGate`/`pairedBootstrap`/…) are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package. > > **`./loops` is the runtime barrel** — `package.json` maps it to `src/runtime/index.ts`. Everything below labelled `/loops` is the recursive-atom + loop-kernel surface. > @@ -55,7 +55,7 @@ Every symbol below is a LOCAL export of this package (subpath shown) unless tagg | Run a sandbox coding rollout, round-synchronous (fresh box per round) | `runLoop(options)` — `/loops` | a `new Sandbox()`+acquire+stream+parse+delete loop, or a 2nd winner-selector | | Run + **resume** ONE persistent box across turns | `openSandboxRun(client, opts, deliverable)` — `/loops` | a per-domain `new Sandbox`+`box.fs.read`+delete copy | | Pick / register a leaf backend, or bring your own agent | `createExecutor({ backend })` / `createExecutorRegistry()` / implement `Executor` — `/loops` | a per-vendor adapter or closed `inline\|sandbox\|cli` switch (won't report through the `UsageEvent` channel) | -| Evolve a **prompt/string** surface | `gepaProposer({ llm, model, target })` (default inside `selfImprove`) — `agent-eval/contract` | a hand-rolled prompt-mutation reflection loop with its own Pareto bookkeeping | +| Evolve a **prompt/string** surface | `gepaProposer({ llm, model, target })` (default inside `selfImprove`; the skill-surface twin is `skillOptProposer`, same source) — `agent-eval/campaign` | a hand-rolled prompt-mutation reflection loop with its own Pareto bookkeeping | | Self-improve a profile (one pluggable verb) — START HERE | `improve(profile, findings, { surface, gate })` — root `.` (the RSI verb; defaults the generator from `surface`, wraps `selfImprove`) | a bespoke optimize loop, or calling `selfImprove`/a skill-optimizer directly for the common case | | Measure **one profile artifact's marginal lift** (with-vs-without, score+cost) / catalog artifacts | `measureMarginalLift(...)` / `ArtifactRegistry` (`applyArtifact` is the one `ArtifactKind`→`AgentProfile`-field bridge) — `/lifecycle` | a hand-rolled with/without ablation loop, or a per-kind `if kind==='skill'…` profile-field switch | | Run the **whole artifact lifecycle** — generate→measure→promote→store→compose, then drift-watch/dedupe the live set — over ANY profile surface (skill/prompt/tool/MCP) | `runLifecycle({ baseline, generators, evalRunner, gate })` then `composeProfile(registry, base, query)`; maintain with `driftWatch(...)` / `dedupeArtifacts(...)` — `/lifecycle` | a per-surface improve loop, a hand-rolled promote→compose step, or re-running `measureMarginalLift` without the registry/gate spine. The ONLY per-surface code is a thin `CandidateGenerator` (`skillGenerator` distills, `promptGenerator`/`buildableGenerator` for the rest) | diff --git a/package.json b/package.json index 1cf890c7..ace311d7 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@tangle-network/agent-runtime", - "version": "0.74.0", + "version": "0.75.0", "description": "Shared task-lifecycle skeleton for agents: a recursive loop kernel for chat turns, one-shot tasks, and multi-attempt loops, with trace capture and eval-gated self-improvement. Domain behavior lives in adapters; scoring and ship-gates in @tangle-network/agent-eval.", "homepage": "https://github.com/tangle-network/agent-runtime#readme", "repository": { @@ -89,7 +89,7 @@ }, "devDependencies": { "@biomejs/biome": "^2.4.15", - "@tangle-network/agent-eval": ">=0.95.0 <1.0.0", + "@tangle-network/agent-eval": ">=0.97.0 <1.0.0", "@tangle-network/agent-interface": ">=0.10.0 <1.0.0", "@tangle-network/sandbox": ">=0.8.0 <1.0.0", "@types/node": "^25.9.3", @@ -117,7 +117,7 @@ "license": "MIT", "packageManager": "pnpm@10.28.0", "peerDependencies": { - "@tangle-network/agent-eval": ">=0.95.0 <1.0.0", + "@tangle-network/agent-eval": ">=0.97.0 <1.0.0", "@tangle-network/agent-interface": ">=0.10.0 <1.0.0", "@tangle-network/agent-knowledge": ">=1.7.0 <2.0.0", "@tangle-network/sandbox": ">=0.8.0 <1.0.0", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index e148ffe3..12ca5b8c 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -16,8 +16,8 @@ importers: specifier: ^2.4.15 version: 2.4.15 '@tangle-network/agent-eval': - specifier: '>=0.95.0 <1.0.0' - version: 0.95.0(typescript@5.9.3) + specifier: '>=0.97.0 <1.0.0' + version: 0.97.0(typescript@5.9.3) '@tangle-network/agent-interface': specifier: '>=0.10.0 <1.0.0' version: 0.10.0 @@ -488,8 +488,8 @@ packages: '@tangle-network/sandbox': optional: true - '@tangle-network/agent-eval@0.95.0': - resolution: {integrity: sha512-qb4ntvDIMj1yPISP2g6zO0qzxYNQINdZLQbxZh4id115nMzUNVij3xHkaeO8EvcFvzqrMJ2s+gK1sbf9gc8Dyw==} + '@tangle-network/agent-eval@0.97.0': + resolution: {integrity: sha512-SCC2QxNgTqrHK0+WNTQIvuZtfcGdSi/ejf7c1x5yGYIS/iM7nYxBNXsr9i64qMcsvxyHB1ecf3ZOJZiw8WpMfQ==} engines: {node: '>=20'} hasBin: true @@ -1423,7 +1423,7 @@ snapshots: - typescript - utf-8-validate - '@tangle-network/agent-eval@0.95.0(typescript@5.9.3)': + '@tangle-network/agent-eval@0.97.0(typescript@5.9.3)': dependencies: '@asteasolutions/zod-to-openapi': 8.5.0(zod@4.4.3) '@ax-llm/ax': 19.0.45(zod@4.4.3) diff --git a/src/improvement/improve.test.ts b/src/improvement/improve.test.ts new file mode 100644 index 00000000..15f2bd00 --- /dev/null +++ b/src/improvement/improve.test.ts @@ -0,0 +1,109 @@ +/** + * `improve()` default-proposer resolution proof. + * + * The regression this guards: `improve()` maps each surface to a default + * `SurfaceProposer` — `prompt → gepaProposer`, `skills → skillOptProposer`. + * Both proposers are factories exported from `@tangle-network/agent-eval/campaign`. + * If either import resolves to `undefined` (a substrate export drift), the facade + * does not fail at module load — it fails at CALL time, the first time a caller + * names that surface. So a green typecheck is not enough; this test drives the + * REAL `improve()` far enough to construct the default proposer and run the + * baseline-only loop to a gate decision. + * + * It is deterministic and offline: `gate: 'none'` forces `generations = 0`, so + * `selfImprove` runs the baseline cells only and never calls the reflection LLM + * the proposer wraps. The stub agent reports a token-bearing cost through + * `ctx.cost` so the substrate's backend-integrity guard (default `'assert'`) + * sees a real backend rather than a silent-zero stub. + */ + +import type { DispatchContext, JudgeConfig, Scenario } from '@tangle-network/agent-eval/contract' +import type { AgentProfile } from '@tangle-network/agent-interface' +import { describe, expect, it } from 'vitest' +import { ConfigError } from '../errors' +import { type ImproveSurface, improve } from './improve' + +// Four scenarios so the train/holdout split is non-empty at the default 0.25 +// holdout fraction (a single scenario yields an empty train split). +const scenarios: Scenario[] = [ + { id: 'a', kind: 'fixture' }, + { id: 'b', kind: 'fixture' }, + { id: 'c', kind: 'fixture' }, + { id: 'd', kind: 'fixture' }, +] + +// A deterministic judge — every artifact scores the same. The POINT is the +// proposer wiring, not a score gradient. +const judge: JudgeConfig<{ text: string }, Scenario> = { + name: 'stub-judge', + dimensions: [{ key: 'q', description: 'fixture quality' }], + score: () => ({ dimensions: { q: 0.5 }, composite: 0.5, notes: '' }), +} + +// The agent reports a token-bearing cost so the backend-integrity guard treats +// it as a real backend. Without `ctx.cost.observeTokens`, the default +// `expectUsage: 'assert'` reads the cell as a silent-zero stub and throws. +async function stubAgent( + surface: unknown, + _scenario: Scenario, + ctx: DispatchContext, +): Promise<{ text: string }> { + ctx.cost.observe(0.0001, 'stub-agent') + ctx.cost.observeTokens({ input: 1, output: 1 }) + return { text: String(surface) } +} + +const promptProfile = (): AgentProfile => ({ + name: 'fixture-agent', + prompt: { systemPrompt: 'be careful' }, +}) + +const skillProfile = (): AgentProfile => ({ + name: 'fixture-agent', + resources: { skills: [] }, +}) + +describe('improve() — default proposer resolution (substrate export drift guard)', () => { + it("surface 'prompt' resolves gepaProposer and runs the baseline loop without crashing", async () => { + const result = await improve(promptProfile(), [], { + surface: 'prompt', + gate: 'none', + scenarios, + judge, + agent: stubAgent, + }) + + // The default gepaProposer was constructed (not undefined) and selfImprove + // ran to a gate decision; a baseline-only run holds. + expect(result.gateDecision).toBe('hold') + expect(result.shipped).toBe(false) + // Baseline-only: nothing shipped, so the profile is returned unchanged. + expect(result.profile.prompt?.systemPrompt).toBe('be careful') + }) + + it("surface 'skills' resolves skillOptProposer and runs the baseline loop without crashing", async () => { + const result = await improve(skillProfile(), [], { + surface: 'skills', + gate: 'none', + scenarios, + judge, + agent: stubAgent, + }) + + expect(result.gateDecision).toBe('hold') + expect(result.shipped).toBe(false) + expect(result.profile.resources?.skills).toEqual([]) + }) + + it('a surface with no zero-config default still fails loud with ConfigError', async () => { + // The default-proposer map covers prompt + skills only; the config surfaces + // (tools/mcp/hooks/code) require a caller-supplied generator. This is the + // designed boundary the proposer migration must NOT erase. + const configSurfaces: ImproveSurface[] = ['tools', 'mcp', 'hooks', 'code'] + for (const surface of configSurfaces) { + await expect( + improve(promptProfile(), [], { surface, gate: 'none', scenarios, judge, agent: stubAgent }), + ).rejects.toBeInstanceOf(ConfigError) + } + }) +}) diff --git a/src/improvement/improve.ts b/src/improvement/improve.ts index 6d353b45..702119db 100644 --- a/src/improvement/improve.ts +++ b/src/improvement/improve.ts @@ -24,10 +24,9 @@ * straight through to `selfImprove`. */ -import { skillOptProposer } from '@tangle-network/agent-eval/campaign' +import { gepaProposer, skillOptProposer } from '@tangle-network/agent-eval/campaign' import { type DispatchContext, - gepaProposer, type JudgeConfig, type MutableSurface, type Scenario, diff --git a/src/lifecycle/prompt-generator.ts b/src/lifecycle/prompt-generator.ts index b812a3c9..28e5fc4d 100644 --- a/src/lifecycle/prompt-generator.ts +++ b/src/lifecycle/prompt-generator.ts @@ -34,13 +34,13 @@ import { type AnalystFinding, callLlmJson, type LlmClientOptions } from '@tangle-network/agent-eval' import { type GenerationRecord, + gepaProposer, isProposedCandidate, type MutableSurface, type ProposeContext, type ProposedCandidate, type SurfaceProposer, } from '@tangle-network/agent-eval/campaign' -import { gepaProposer } from '@tangle-network/agent-eval/contract' import type { AgentProfile } from '@tangle-network/agent-interface' import type { CandidateGenerator, GenerateContext } from './generator' import type { ArtifactInput } from './types' diff --git a/src/lifecycle/skill-generator.ts b/src/lifecycle/skill-generator.ts index e40a7e05..ef48aa4d 100644 --- a/src/lifecycle/skill-generator.ts +++ b/src/lifecycle/skill-generator.ts @@ -13,9 +13,10 @@ * reflection over the trace produces the first draft. * * 2. REFINE — take the distilled draft and improve its wording/structure. The - * production `refine` wraps agent-eval's skill optimizer (`runSkillOpt`). - * Refinement is optional: with no `refine`, the distilled draft IS the - * candidate. + * production `refine` drives agent-eval's `skillOptProposer` (the uniform + * skill-surface proposer factory from `@tangle-network/agent-eval/campaign`, + * the same source as `gepaProposer` for the prompt surface). Refinement is + * optional: with no `refine`, the distilled draft IS the candidate. * * Both steps are INJECTED seams, not hardcoded engines — per the §1.5 law, the * generator AUTHORS a profile piece; it does not embed a specific LLM loop. That @@ -48,8 +49,8 @@ export type DistillSkills = (ctx: GenerateContext) => Promise | Sk /** * REFINE — improve ONE distilled draft (wording, structure, examples). The - * production implementation wraps `runSkillOpt`. Returns the refined draft; when - * omitted from `skillGenerator`, the distilled draft is used as-is. + * production implementation drives `skillOptProposer`. Returns the refined draft; + * when omitted from `skillGenerator`, the distilled draft is used as-is. */ export type RefineSkill = (draft: SkillDraft) => Promise | SkillDraft @@ -65,10 +66,10 @@ export interface SkillGeneratorOptions { * from history, then (optionally) refines them, and emits each as a `skill` * artifact carrying an inline `SKILL.md` resource ref. * - * @example Production wiring (distill = LLM reflection, refine = skillOpt): + * @example Production wiring (distill = LLM reflection, refine = skillOptProposer): * skillGenerator({ * distill: reflectiveDistill, // creates the draft from traces - * refine: skillOptRefine, // optimizes the draft + * refine: skillOptRefine, // optimizes the draft via skillOptProposer * }) */ export function skillGenerator(opts: SkillGeneratorOptions): CandidateGenerator<'skill'> {