Skip to content

Implement spec 043: mur find + sample catalogue#451

Open
sundaramramaswamy wants to merge 21 commits into
mainfrom
feature/mur-find
Open

Implement spec 043: mur find + sample catalogue#451
sundaramramaswamy wants to merge 21 commits into
mainfrom
feature/mur-find

Conversation

@sundaramramaswamy
Copy link
Copy Markdown
Collaborator

Summary

Implements spec 043 (mur find / sample catalogue) in two phases:

Phase 1 — Tool skeleton

  • CLI dispatch for \mur find, \mur get, \mur list\ commands
  • BM25 search engine with factory grouping and field weighting
  • Synonym expansion (phrase collapse → tokenize → expand)
  • Notes (pitfall guidance per factory anchor)
  • Embedded resource pipeline (\scenarios.json\ via build-time extractor)
  • AOT-compatible JSON via \System.Text.Json\ source generation

Phase 2 — P0 catalogue (~64 scenarios)

  • Hooks (16): useState, useReducer, useEffect, useRef, useMemo, useCallback, useContext, custom hooks
  • Layout (11): VStack, HStack, FlexRow, FlexColumn, Grid, Border, Card, ScrollView, Canvas, named styles
  • Text (6): TextBlock, type ramp, RichTextBlock, wrapping/truncation, localization
  • Buttons (6): Button, icon, command, hyperlink, toggle, CommandBar
  • Inputs (11): TextField, NumberBox, CheckBox, ToggleSwitch, RadioButtons, ComboBox, AutoSuggestBox, CalendarDatePicker, Slider, calendar-multiselect
  • Lists (6): ForEach, add/delete/toggle, empty state, loading, master-detail, virtualized
  • Forms (6): text fields, validation context, FormField wrapper, submit gating, async submit, server errors
  • Navigation (1): sidebar-nav with typed routing
  • Expanded synonym maps (~90 phrases, ~90 synonyms)
  • Expanded pitfall notes (28 entries)
  • Ported 9 legacy reactor-recipes → scenarios, deleted old files
  • Updated skill file references

Validation

  • 8321 unit tests pass, 0 failures
  • 64 scenarios extracted by catalogue tool
  • All scenario files follow the authoring contract (Scenario.cs + scenario.json)

Closes #355

sundaramramaswamy and others added 21 commits May 20, 2026 22:53
Scenario, ScenarioCatalogue, SearchResult records
with System.Text.Json source-gen for AOT compat.

Spec 043 Phase 1, work item 1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
One scenario (use-state-basic) + README with
authoring contract. Spec 043 Phase 1, item 5.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DataLoader reads embedded scenarios.json via
manifest resource. Notes seeds 5 pitfall entries.

Spec 043 Phase 1, work item 3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Walks samples/scenarios/, validates JSON+CS,
strips metadata headers, emits scenarios.json.

Spec 043 Phase 1, work item 6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two-layer BM25 scorer, stop words, synonym/phrase
maps, SearchEngine with factory grouping.

Spec 043 Phase 1, work item 2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FindCommand uses SearchEngine for BM25 ranking.
GetCommand shows scenario + notes + related.
ListCommand groups by category. Program.cs dispatch.

Spec 043 Phase 1, work item 4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BM25, SearchEngine, Synonyms, Notes tests.
Spec 043 Phase 1, work item 7.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Generate via SampleCatalogue extractor, embed in
Reactor.Cli.csproj. Remove unnecessary PackageRef
from extractor.

Spec 043 Phase 1, wiring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ~90 phrase collapses and ~90 synonym entries
covering all P0 factories, cross-framework terms,
and abbreviations per spec 043 §4.7.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cover all P0 factory anchors with 2-3 practical
notes each per spec 043 §4.8.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
textblock-basic, heading-subhead-caption,
body-bodystrong, rich-text-inlines,
text-wrap-truncate, localized-text.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
vstack-basic, hstack-basic, flexrow-with-grow,
flexcolumn-with-justify, grid-basic, grid-spans,
border-with-corner, card-surface,
scrollviewer-vertical, canvas-positioning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
button-label-onclick, button-with-icon,
button-with-command, hyperlink-button,
togglebutton-basic, appbarbutton-in-commandbar.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
form-text-fields, form-validation-context,
form-field-wrapper, form-submit-gating,
form-async-submit, form-with-server-errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
list-basic-foreach, list-add-delete-toggle,
list-with-empty-state, list-with-loading,
master-detail, virtualized-large-list.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
textfield-twoway, numberbox-validated,
checkbox-bool, toggleswitch, radiobuttons-group,
combobox-from-list, combobox-of-elements,
autosuggestbox-typeahead, calendardatepicker,
slider-range.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
use-state-record, use-state-list-pitfall,
use-reducer-list, use-reducer-typed,
use-effect-mount, use-effect-deps,
use-effect-cleanup, use-ref-dom, use-ref-mutable,
use-memo, use-callback, use-context-basic,
use-context-multi, use-reducer-with-context,
custom-hook-pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
calendar-multiselect, sidebar-nav,
async-fetch-list, named-styles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All 9 recipes ported to samples/scenarios/.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Point reactor-recipes skill at samples/scenarios/
instead of deleted references/*.cs files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@codemonkeychris codemonkeychris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run this with the eval framework to see if this actually helps the results? part of this specific one is to iterate on the evals to see it improve. if you need help getting the evals going, I can point you at the repo (this is Nikola's benchmark)

Console.WriteLine("```");

var notes = Notes.GetNotes(scenario.NotesKey);
if (notes is { Length: > 0 } && scenario.NotesKey is not null)
var (debounced, setDebounced) = ctx.UseState(value);
ctx.UseEffect(() =>
{
var cts = new CancellationTokenSource();
Comment on lines +240 to +246
foreach (Match match in TokenRegex().Matches(text.ToLowerInvariant()))
{
if (match.Value.Length > 0)
{
yield return match.Value;
}
}
Comment on lines +47 to +49
catch (TaskCanceledException)
{
}
@sundaramramaswamy
Copy link
Copy Markdown
Collaborator Author

@codemonkeychris — ran the eval framework as you suggested. Headline: the catalogue measurably improves results, and only after a skill prescribes mur find <intent> does the agent actually use it.

Setup

Used win-dev-skills-benchmark (Nikola's repo). Two probes were necessary, because the first one surfaced a sub-bug:

  1. Pilot A — counter-winui, reactor-mur agent, sonnet-4.6, catalogue present, no mur find prompt in the skill. Score 63/100. Build-events show zero mur find calls. The agent loaded the skill, saw mur check mentioned, never thought to query the catalogue.
  2. Fix the prompt. Added a short "Find a working scenario first — mur find <intent>" section to reactor-getting-started SKILL.md (commit 639278a4). Synced to the benchmark's reactor-mur skill.
  3. Pilot B — same scenario, same model, with the prompt. Score 70/100 (+7). Build-events show 4 actual mur find / mur get invocations.

That's the clean A/B: identical model + scenario, only the skill prescription changed, +7 score points and observable behavior change.

Broader sweep (4 scenarios × reactor-mur × sonnet-4.6, 1 iter)

Scenario Score Builds Runs mur find mur get
counter-winui 70 4 (incl. in get)
kanban 71 7 9
pomodoro 69 5 6
paint-app 81 5 6
markdown-editor-winui 0 10 7
  • paint-app: 81. Historical baseline run on the same scenario (run7, opus-4.6, old catalogue, no prompt) crashed at startup for 13. The catalogue let the agent ground itself before writing canvas code.
  • markdown-editor-winui: 0. Timeout (25-min cap) — not a catalogue failure. The agent did consult the catalogue (10 finds, 7 gets) but didn't converge. Same scenario routinely times out under bare conditions too.
  • n=5 trials, 1 iteration. Not stats-grade, but every successful trial used mur find 4–10 times and built+ran.

Gaps found while running this

A few things to fix in a follow-up (NOT this PR):

  1. mur check --platform x64 — agent invoked this and got "unknown flag". Worked around with dotnet build -p:Platform=x64. Need to accept that flag.
  2. No persistence/LocalSettings scenario — agent searched mur find "UsePersisted" and got nothing. Real coverage gap.
  3. Benchmark's reactor-mur skill lives in a foreign repo — sync is currently manual. Probably wants an upstream PR to win-dev-skills-benchmark after this lands.

Evidence

Raw trial dirs (local):

  • D:\win-dev-skills-benchmark\agent-benchmark\results\run9\cw49_reactor-mur_s46_i1\ — counter-winui +7 A/B
  • D:\win-dev-skills-benchmark\agent-benchmark\results\run10\ — 4-scenario sweep
  • session-logs-dir\build-events.jsonl in each contains the actual mur find "<query>" calls

Happy to run the full 22-scenario × baseline-vs-treatment matrix if you want a stronger statistical claim, but it'd cost ~3 premium reqs/trial × 44 trials. The A/B above plus the sweep table is the cleanest signal I can produce in a couple of hours.

@sundaramramaswamy
Copy link
Copy Markdown
Collaborator Author

Controlled A/B — you were right to push back. Re-ran with a proper control.

Setup

Identical scenario (counter-winui), identical model (sonnet-4.6), identical skill (with the "find first" prompt). Only difference: which mur.dll is on disk.

  • Treatment: real catalogue (5 results for "counter")
  • Control: stub catalogue — same mur binary surface, but scenarios.json is {"scenarios":[]} so every mur find returns "No matches found for ..."

n=3 trials per arm, run sequentially (treatment first, then swap dll, then control).

Results

Arm i1 i2 i3 mean median range
Treatment (real catalogue) 64 71 73 69.3 71 64–73
Control (stub catalogue) 52 73 64 63.0 64 52–73
Δ (T − C) +6.3 +7

All 6 trials built and ran successfully. Behavioral evidence:

  • Treatment agents made 1–8 mur find calls per trial and acted on the returned IDs.
  • Control agents made exactly 3 mur find calls each, all returning "No matches", then gave up and wrote from skill memory.

Honest take

Catalogue helps, but it's a smaller effect than my last post implied.

  • Mean +6.3, median +7: catalogue trials win on aggregate. The control's worst (52) is much weaker than the treatment's worst (64), suggesting catalogue provides a floor — when the agent flails, examples ground it.
  • Best=Best (73=73): when sonnet-4.6 has good instincts on a given roll, it builds a fine counter app from the skill alone. The catalogue isn't load-bearing for easy scenarios.
  • n=3 is suggestive, not significant: distributions overlap. A 6-point mean delta could be noise at this sample size. Would need n≥10 per arm to claim significance.

What we definitively know now:

  1. The "find first" prompt drives behavior change (confirmed — control still made 3 calls/trial).
  2. Returning real results instead of "No matches" pushes mean score up by ~6 points on counter-winui.
  3. Whether that effect generalizes to harder scenarios (kanban/markdown-editor) and survives more trials is still untested.

Raw data: D:\win-dev-skills-benchmark\agent-benchmark\results\run11\ (treatment) and \run12\ (control).

Recommendation: merge on the strength of (a) behavioral change being undeniable, (b) the floor-raising effect being directionally clear, and (c) the catalogue being mechanically sound. But don't oversell it as a dramatic improvement — it's a real but modest grounding effect. Bigger sample sizes and harder scenarios are followup work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement spec 043: mur find / sample catalogue (Phase 1)

2 participants