Implement spec 043: mur find + sample catalogue by sundaramramaswamy · Pull Request #451 · microsoft/microsoft-ui-reactor

sundaramramaswamy · 2026-05-29T16:22:45Z

Summary

Implements spec 043 (mur find / sample catalogue) in two phases:

Phase 1 — Tool skeleton

CLI dispatch for \mur find, \mur get, \mur list\ commands
BM25 search engine with factory grouping and field weighting
Synonym expansion (phrase collapse → tokenize → expand)
Notes (pitfall guidance per factory anchor)
Embedded resource pipeline (\scenarios.json\ via build-time extractor)
AOT-compatible JSON via \System.Text.Json\ source generation

Phase 2 — P0 catalogue (~64 scenarios)

Hooks (16): useState, useReducer, useEffect, useRef, useMemo, useCallback, useContext, custom hooks
Layout (11): VStack, HStack, FlexRow, FlexColumn, Grid, Border, Card, ScrollView, Canvas, named styles
Text (6): TextBlock, type ramp, RichTextBlock, wrapping/truncation, localization
Buttons (6): Button, icon, command, hyperlink, toggle, CommandBar
Inputs (11): TextField, NumberBox, CheckBox, ToggleSwitch, RadioButtons, ComboBox, AutoSuggestBox, CalendarDatePicker, Slider, calendar-multiselect
Lists (6): ForEach, add/delete/toggle, empty state, loading, master-detail, virtualized
Forms (6): text fields, validation context, FormField wrapper, submit gating, async submit, server errors
Navigation (1): sidebar-nav with typed routing
Expanded synonym maps (~90 phrases, ~90 synonyms)
Expanded pitfall notes (28 entries)
Ported 9 legacy reactor-recipes → scenarios, deleted old files
Updated skill file references

Validation

8321 unit tests pass, 0 failures
64 scenarios extracted by catalogue tool
All scenario files follow the authoring contract (Scenario.cs + scenario.json)

Closes #355

Scenario, ScenarioCatalogue, SearchResult records with System.Text.Json source-gen for AOT compat. Spec 043 Phase 1, work item 1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

One scenario (use-state-basic) + README with authoring contract. Spec 043 Phase 1, item 5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DataLoader reads embedded scenarios.json via manifest resource. Notes seeds 5 pitfall entries. Spec 043 Phase 1, work item 3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Walks samples/scenarios/, validates JSON+CS, strips metadata headers, emits scenarios.json. Spec 043 Phase 1, work item 6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two-layer BM25 scorer, stop words, synonym/phrase maps, SearchEngine with factory grouping. Spec 043 Phase 1, work item 2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FindCommand uses SearchEngine for BM25 ranking. GetCommand shows scenario + notes + related. ListCommand groups by category. Program.cs dispatch. Spec 043 Phase 1, work item 4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

BM25, SearchEngine, Synonyms, Notes tests. Spec 043 Phase 1, work item 7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Generate via SampleCatalogue extractor, embed in Reactor.Cli.csproj. Remove unnecessary PackageRef from extractor. Spec 043 Phase 1, wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add ~90 phrase collapses and ~90 synonym entries covering all P0 factories, cross-framework terms, and abbreviations per spec 043 §4.7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Cover all P0 factory anchors with 2-3 practical notes each per spec 043 §4.8. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

textblock-basic, heading-subhead-caption, body-bodystrong, rich-text-inlines, text-wrap-truncate, localized-text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

vstack-basic, hstack-basic, flexrow-with-grow, flexcolumn-with-justify, grid-basic, grid-spans, border-with-corner, card-surface, scrollviewer-vertical, canvas-positioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

button-label-onclick, button-with-icon, button-with-command, hyperlink-button, togglebutton-basic, appbarbutton-in-commandbar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

form-text-fields, form-validation-context, form-field-wrapper, form-submit-gating, form-async-submit, form-with-server-errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

list-basic-foreach, list-add-delete-toggle, list-with-empty-state, list-with-loading, master-detail, virtualized-large-list. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

textfield-twoway, numberbox-validated, checkbox-bool, toggleswitch, radiobuttons-group, combobox-from-list, combobox-of-elements, autosuggestbox-typeahead, calendardatepicker, slider-range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

use-state-record, use-state-list-pitfall, use-reducer-list, use-reducer-typed, use-effect-mount, use-effect-deps, use-effect-cleanup, use-ref-dom, use-ref-mutable, use-memo, use-callback, use-context-basic, use-context-multi, use-reducer-with-context, custom-hook-pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

calendar-multiselect, sidebar-nav, async-fetch-list, named-styles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

All 9 recipes ported to samples/scenarios/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Point reactor-recipes skill at samples/scenarios/ instead of deleted references/*.cs files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codemonkeychris

can you run this with the eval framework to see if this actually helps the results? part of this specific one is to iterate on the evals to see it improve. if you need help getting the evals going, I can point you at the repo (this is Nikola's benchmark)

+        Console.WriteLine("```");
+
+        var notes = Notes.GetNotes(scenario.NotesKey);
+        if (notes is { Length: > 0 } && scenario.NotesKey is not null)


+        var (debounced, setDebounced) = ctx.UseState(value);
+        ctx.UseEffect(() =>
+        {
+            var cts = new CancellationTokenSource();


+        foreach (Match match in TokenRegex().Matches(text.ToLowerInvariant()))
+        {
+            if (match.Value.Length > 0)
+            {
+                yield return match.Value;
+            }
+        }


+                catch (TaskCanceledException)
+                {
+                }


sundaramramaswamy · 2026-06-04T11:35:32Z

@codemonkeychris — ran the eval framework as you suggested. Headline: the catalogue measurably improves results, and only after a skill prescribes mur find <intent> does the agent actually use it.

Setup

Used win-dev-skills-benchmark (Nikola's repo). Two probes were necessary, because the first one surfaced a sub-bug:

Pilot A — counter-winui, reactor-mur agent, sonnet-4.6, catalogue present, no mur find prompt in the skill. Score 63/100. Build-events show zero mur find calls. The agent loaded the skill, saw mur check mentioned, never thought to query the catalogue.
Fix the prompt. Added a short "Find a working scenario first — mur find <intent>" section to reactor-getting-started SKILL.md (commit 639278a4). Synced to the benchmark's reactor-mur skill.
Pilot B — same scenario, same model, with the prompt. Score 70/100 (+7). Build-events show 4 actual mur find / mur get invocations.

That's the clean A/B: identical model + scenario, only the skill prescription changed, +7 score points and observable behavior change.

Broader sweep (4 scenarios × `reactor-mur` × sonnet-4.6, 1 iter)

Scenario	Score	Builds	Runs	`mur find`	`mur get`
counter-winui	70	✅	✅	4	(incl. in get)
kanban	71	✅	✅	7	9
pomodoro	69	✅	✅	5	6
paint-app	81	✅	✅	5	6
markdown-editor-winui	0	❌	❌	10	7

paint-app: 81. Historical baseline run on the same scenario (run7, opus-4.6, old catalogue, no prompt) crashed at startup for 13. The catalogue let the agent ground itself before writing canvas code.
markdown-editor-winui: 0. Timeout (25-min cap) — not a catalogue failure. The agent did consult the catalogue (10 finds, 7 gets) but didn't converge. Same scenario routinely times out under bare conditions too.
n=5 trials, 1 iteration. Not stats-grade, but every successful trial used mur find 4–10 times and built+ran.

Gaps found while running this

A few things to fix in a follow-up (NOT this PR):

mur check --platform x64 — agent invoked this and got "unknown flag". Worked around with dotnet build -p:Platform=x64. Need to accept that flag.
No persistence/LocalSettings scenario — agent searched mur find "UsePersisted" and got nothing. Real coverage gap.
Benchmark's reactor-mur skill lives in a foreign repo — sync is currently manual. Probably wants an upstream PR to win-dev-skills-benchmark after this lands.

Evidence

Raw trial dirs (local):

D:\win-dev-skills-benchmark\agent-benchmark\results\run9\cw49_reactor-mur_s46_i1\ — counter-winui +7 A/B
D:\win-dev-skills-benchmark\agent-benchmark\results\run10\ — 4-scenario sweep
session-logs-dir\build-events.jsonl in each contains the actual mur find "<query>" calls

Happy to run the full 22-scenario × baseline-vs-treatment matrix if you want a stronger statistical claim, but it'd cost ~3 premium reqs/trial × 44 trials. The A/B above plus the sweep table is the cleanest signal I can produce in a couple of hours.

sundaramramaswamy · 2026-06-04T13:09:56Z

Controlled A/B — you were right to push back. Re-ran with a proper control.

Setup

Identical scenario (counter-winui), identical model (sonnet-4.6), identical skill (with the "find first" prompt). Only difference: which mur.dll is on disk.

Treatment: real catalogue (5 results for "counter")
Control: stub catalogue — same mur binary surface, but scenarios.json is {"scenarios":[]} so every mur find returns "No matches found for ..."

n=3 trials per arm, run sequentially (treatment first, then swap dll, then control).

Results

Arm	i1	i2	i3	mean	median	range
Treatment (real catalogue)	64	71	73	69.3	71	64–73
Control (stub catalogue)	52	73	64	63.0	64	52–73
Δ (T − C)				+6.3	+7

All 6 trials built and ran successfully. Behavioral evidence:

Treatment agents made 1–8 mur find calls per trial and acted on the returned IDs.
Control agents made exactly 3 mur find calls each, all returning "No matches", then gave up and wrote from skill memory.

Honest take

Catalogue helps, but it's a smaller effect than my last post implied.

Mean +6.3, median +7: catalogue trials win on aggregate. The control's worst (52) is much weaker than the treatment's worst (64), suggesting catalogue provides a floor — when the agent flails, examples ground it.
Best=Best (73=73): when sonnet-4.6 has good instincts on a given roll, it builds a fine counter app from the skill alone. The catalogue isn't load-bearing for easy scenarios.
n=3 is suggestive, not significant: distributions overlap. A 6-point mean delta could be noise at this sample size. Would need n≥10 per arm to claim significance.

What we definitively know now:

The "find first" prompt drives behavior change (confirmed — control still made 3 calls/trial).
Returning real results instead of "No matches" pushes mean score up by ~6 points on counter-winui.
Whether that effect generalizes to harder scenarios (kanban/markdown-editor) and survives more trials is still untested.

Raw data: D:\win-dev-skills-benchmark\agent-benchmark\results\run11\ (treatment) and \run12\ (control).

Recommendation: merge on the strength of (a) behavioral change being undeniable, (b) the floor-raising effect being directionally clear, and (c) the catalogue being mechanically sound. But don't oversell it as a dramatic improvement — it's a real but modest grounding effect. Bigger sample sizes and harder scenarios are followup work.

sundaramramaswamy and others added 21 commits May 20, 2026 22:53

Add Find data model POCOs

a6b5722

Scenario, ScenarioCatalogue, SearchResult records with System.Text.Json source-gen for AOT compat. Spec 043 Phase 1, work item 1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add stub scenario catalogue

a60da16

One scenario (use-state-basic) + README with authoring contract. Spec 043 Phase 1, item 5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add DataLoader and Notes for Find

cbb9a4a

DataLoader reads embedded scenarios.json via manifest resource. Notes seeds 5 pitfall entries. Spec 043 Phase 1, work item 3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add SampleCatalogue build-time extractor

793afea

Walks samples/scenarios/, validates JSON+CS, strips metadata headers, emits scenarios.json. Spec 043 Phase 1, work item 6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add BM25 search engine for Find

2767eb8

Two-layer BM25 scorer, stop words, synonym/phrase maps, SearchEngine with factory grouping. Spec 043 Phase 1, work item 2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wire find/get/list CLI commands

a49731a

FindCommand uses SearchEngine for BM25 ranking. GetCommand shows scenario + notes + related. ListCommand groups by category. Program.cs dispatch. Spec 043 Phase 1, work item 4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add unit tests for Find subsystem

ae8842c

BM25, SearchEngine, Synonyms, Notes tests. Spec 043 Phase 1, work item 7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wire scenarios.json as embedded resource

d8c2ffe

Generate via SampleCatalogue extractor, embed in Reactor.Cli.csproj. Remove unnecessary PackageRef from extractor. Spec 043 Phase 1, wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Expand synonym + phrase maps to full spec

98e6263

Add ~90 phrase collapses and ~90 synonym entries covering all P0 factories, cross-framework terms, and abbreviations per spec 043 §4.7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Expand pitfall notes to 28 entries

1a5304c

Cover all P0 factory anchors with 2-3 practical notes each per spec 043 §4.8. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 6 text scenarios (P0)

4f33311

textblock-basic, heading-subhead-caption, body-bodystrong, rich-text-inlines, text-wrap-truncate, localized-text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 10 layout scenarios (P0)

3861b47

vstack-basic, hstack-basic, flexrow-with-grow, flexcolumn-with-justify, grid-basic, grid-spans, border-with-corner, card-surface, scrollviewer-vertical, canvas-positioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 6 button scenarios (P0)

d3f47bd

button-label-onclick, button-with-icon, button-with-command, hyperlink-button, togglebutton-basic, appbarbutton-in-commandbar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 6 form scenarios (P0)

21f6a2d

form-text-fields, form-validation-context, form-field-wrapper, form-submit-gating, form-async-submit, form-with-server-errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 6 list scenarios (P0)

e107fe7

list-basic-foreach, list-add-delete-toggle, list-with-empty-state, list-with-loading, master-detail, virtualized-large-list. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 10 input scenarios (P0)

0cfe60e

textfield-twoway, numberbox-validated, checkbox-bool, toggleswitch, radiobuttons-group, combobox-from-list, combobox-of-elements, autosuggestbox-typeahead, calendardatepicker, slider-range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port 4 remaining recipes to scenarios

de7760b

calendar-multiselect, sidebar-nav, async-fetch-list, named-styles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove legacy reactor-recipes references

10a9de5

All 9 recipes ported to samples/scenarios/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update skill file for scenario catalogue

03f3961

Point reactor-recipes skill at samples/scenarios/ instead of deleted references/*.cs files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Regen scenarios.json (64 scenarios)

07be9f2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sundaramramaswamy requested a review from codemonkeychris as a code owner May 29, 2026 16:22

codemonkeychris requested changes May 29, 2026

View reviewed changes

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement spec 043: mur find + sample catalogue#451

Implement spec 043: mur find + sample catalogue#451
sundaramramaswamy wants to merge 21 commits into
mainfrom
feature/mur-find

sundaramramaswamy commented May 29, 2026

Uh oh!

codemonkeychris left a comment

Uh oh!

sundaramramaswamy commented Jun 4, 2026

Uh oh!

sundaramramaswamy commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sundaramramaswamy commented May 29, 2026

Summary

Phase 1 — Tool skeleton

Phase 2 — P0 catalogue (~64 scenarios)

Validation

Uh oh!

codemonkeychris left a comment

Choose a reason for hiding this comment

Uh oh!

sundaramramaswamy commented Jun 4, 2026

Setup

Broader sweep (4 scenarios × reactor-mur × sonnet-4.6, 1 iter)

Gaps found while running this

Evidence

Uh oh!

sundaramramaswamy commented Jun 4, 2026

Setup

Results

Honest take

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Broader sweep (4 scenarios × `reactor-mur` × sonnet-4.6, 1 iter)