Implement spec 043: mur find + sample catalogue#451
Conversation
Scenario, ScenarioCatalogue, SearchResult records with System.Text.Json source-gen for AOT compat. Spec 043 Phase 1, work item 1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
One scenario (use-state-basic) + README with authoring contract. Spec 043 Phase 1, item 5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DataLoader reads embedded scenarios.json via manifest resource. Notes seeds 5 pitfall entries. Spec 043 Phase 1, work item 3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Walks samples/scenarios/, validates JSON+CS, strips metadata headers, emits scenarios.json. Spec 043 Phase 1, work item 6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two-layer BM25 scorer, stop words, synonym/phrase maps, SearchEngine with factory grouping. Spec 043 Phase 1, work item 2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FindCommand uses SearchEngine for BM25 ranking. GetCommand shows scenario + notes + related. ListCommand groups by category. Program.cs dispatch. Spec 043 Phase 1, work item 4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BM25, SearchEngine, Synonyms, Notes tests. Spec 043 Phase 1, work item 7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Generate via SampleCatalogue extractor, embed in Reactor.Cli.csproj. Remove unnecessary PackageRef from extractor. Spec 043 Phase 1, wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ~90 phrase collapses and ~90 synonym entries covering all P0 factories, cross-framework terms, and abbreviations per spec 043 §4.7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cover all P0 factory anchors with 2-3 practical notes each per spec 043 §4.8. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
textblock-basic, heading-subhead-caption, body-bodystrong, rich-text-inlines, text-wrap-truncate, localized-text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
vstack-basic, hstack-basic, flexrow-with-grow, flexcolumn-with-justify, grid-basic, grid-spans, border-with-corner, card-surface, scrollviewer-vertical, canvas-positioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
button-label-onclick, button-with-icon, button-with-command, hyperlink-button, togglebutton-basic, appbarbutton-in-commandbar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
form-text-fields, form-validation-context, form-field-wrapper, form-submit-gating, form-async-submit, form-with-server-errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
list-basic-foreach, list-add-delete-toggle, list-with-empty-state, list-with-loading, master-detail, virtualized-large-list. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
textfield-twoway, numberbox-validated, checkbox-bool, toggleswitch, radiobuttons-group, combobox-from-list, combobox-of-elements, autosuggestbox-typeahead, calendardatepicker, slider-range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
use-state-record, use-state-list-pitfall, use-reducer-list, use-reducer-typed, use-effect-mount, use-effect-deps, use-effect-cleanup, use-ref-dom, use-ref-mutable, use-memo, use-callback, use-context-basic, use-context-multi, use-reducer-with-context, custom-hook-pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
calendar-multiselect, sidebar-nav, async-fetch-list, named-styles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All 9 recipes ported to samples/scenarios/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Point reactor-recipes skill at samples/scenarios/ instead of deleted references/*.cs files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
codemonkeychris
left a comment
There was a problem hiding this comment.
can you run this with the eval framework to see if this actually helps the results? part of this specific one is to iterate on the evals to see it improve. if you need help getting the evals going, I can point you at the repo (this is Nikola's benchmark)
| Console.WriteLine("```"); | ||
|
|
||
| var notes = Notes.GetNotes(scenario.NotesKey); | ||
| if (notes is { Length: > 0 } && scenario.NotesKey is not null) |
| var (debounced, setDebounced) = ctx.UseState(value); | ||
| ctx.UseEffect(() => | ||
| { | ||
| var cts = new CancellationTokenSource(); |
| foreach (Match match in TokenRegex().Matches(text.ToLowerInvariant())) | ||
| { | ||
| if (match.Value.Length > 0) | ||
| { | ||
| yield return match.Value; | ||
| } | ||
| } |
| catch (TaskCanceledException) | ||
| { | ||
| } |
|
@codemonkeychris — ran the eval framework as you suggested. Headline: the catalogue measurably improves results, and only after a skill prescribes SetupUsed
That's the clean A/B: identical model + scenario, only the skill prescription changed, +7 score points and observable behavior change. Broader sweep (4 scenarios ×
|
| Scenario | Score | Builds | Runs | mur find |
mur get |
|---|---|---|---|---|---|
| counter-winui | 70 | ✅ | ✅ | 4 | (incl. in get) |
| kanban | 71 | ✅ | ✅ | 7 | 9 |
| pomodoro | 69 | ✅ | ✅ | 5 | 6 |
| paint-app | 81 | ✅ | ✅ | 5 | 6 |
| markdown-editor-winui | 0 | ❌ | ❌ | 10 | 7 |
- paint-app: 81. Historical baseline run on the same scenario (run7, opus-4.6, old catalogue, no prompt) crashed at startup for 13. The catalogue let the agent ground itself before writing canvas code.
- markdown-editor-winui: 0. Timeout (25-min cap) — not a catalogue failure. The agent did consult the catalogue (10 finds, 7 gets) but didn't converge. Same scenario routinely times out under bare conditions too.
- n=5 trials, 1 iteration. Not stats-grade, but every successful trial used
mur find4–10 times and built+ran.
Gaps found while running this
A few things to fix in a follow-up (NOT this PR):
mur check --platform x64— agent invoked this and got "unknown flag". Worked around withdotnet build -p:Platform=x64. Need to accept that flag.- No persistence/
LocalSettingsscenario — agent searchedmur find "UsePersisted"and got nothing. Real coverage gap. - Benchmark's
reactor-murskill lives in a foreign repo — sync is currently manual. Probably wants an upstream PR towin-dev-skills-benchmarkafter this lands.
Evidence
Raw trial dirs (local):
D:\win-dev-skills-benchmark\agent-benchmark\results\run9\cw49_reactor-mur_s46_i1\— counter-winui +7 A/BD:\win-dev-skills-benchmark\agent-benchmark\results\run10\— 4-scenario sweepsession-logs-dir\build-events.jsonlin each contains the actualmur find "<query>"calls
Happy to run the full 22-scenario × baseline-vs-treatment matrix if you want a stronger statistical claim, but it'd cost ~3 premium reqs/trial × 44 trials. The A/B above plus the sweep table is the cleanest signal I can produce in a couple of hours.
|
Controlled A/B — you were right to push back. Re-ran with a proper control. SetupIdentical scenario (counter-winui), identical model (sonnet-4.6), identical skill (with the "find first" prompt). Only difference: which
n=3 trials per arm, run sequentially (treatment first, then swap dll, then control). Results
All 6 trials built and ran successfully. Behavioral evidence:
Honest takeCatalogue helps, but it's a smaller effect than my last post implied.
What we definitively know now:
Raw data: Recommendation: merge on the strength of (a) behavioral change being undeniable, (b) the floor-raising effect being directionally clear, and (c) the catalogue being mechanically sound. But don't oversell it as a dramatic improvement — it's a real but modest grounding effect. Bigger sample sizes and harder scenarios are followup work. |
Summary
Implements spec 043 (mur find / sample catalogue) in two phases:
Phase 1 — Tool skeleton
Phase 2 — P0 catalogue (~64 scenarios)
Validation
Closes #355