Derive portable artifact roots from versionControlProvenance at emit-finalize by michaelcfanning · Pull Request #2973 · microsoft/sarif-sdk

michaelcfanning · 2026-06-05T14:45:39Z

Summary

Replaces the caller-supplied --srcroot rewrite at emit-finalize with a VCP-driven rebasing visitor. At finalize, EmitFinalizeRebaseVisitor deconstructs absolute local file paths into relative URIs plus portable, per-repository uriBaseIds derived from run.versionControlProvenance. The shipped SARIF carries no machine-specific path.

Portable-root derivation supports github.com and dev.azure.com in this release:

github.com → clickable commit permalink https://github.com/<owner>/<repo>/blob/<revisionId>/.
dev.azure.com → commit-less repository root https://dev.azure.com/<org>/<project>/_git/<repo>/. ADO per-file web URLs are query-based (?path=...&version=GC<sha>) and cannot serve as a SARIF uriBaseId prefix (RFC 3986 relative resolution drops the query), so commit pinning rides on versionControlProvenance.revisionId, which finalize preserves and GHAzDO ingestion reads.

Follow-up to #2971 (get-schema + emit-run rename + receipt-time SRCROOT-exists check). Fast-follow scope tracked in #2972.

Contract (hard-enforced at finalize; FAILURE before any file is written)

Each run declares ≥1 versionControlProvenance entry whose mappedTo carries only a uriBaseId, binding a declared originalUriBaseIds root.
repositoryUri must be an absolute http(s) URI at the host's default port; revisionId non-empty.
A repositoryUri carrying credentials (user:password@), a query, or a fragment fails closed, and all diagnostics are sanitized so a secret can never surface in an error message.
Host derivation: github.com (<owner>/<repo>) and dev.azure.com (<org>/<project>/_git/<repo>); a .git suffix is stripped and empty / dotted / multi-segment repo names are rejected through a single shared guard. Other hosts fail with a clear, planned-follow-up message.
Base-id minting: one repo → bare SRCROOT; multiple → SRCROOT_<REPO-LEAF> with _2,_3 collision suffix. Mixed github + ADO repos in one run each get a distinct base. Artifact locations attribute to the owning repo by longest-prefix match on the mapped local root.
An absolute local path that no declared root resolves fails closed (warn-and-ship would leak the machine path) — confirmed with @michaelcfanning.

emit-run stamping

TryStampVcp sets versionControlProvenance[0].mappedTo.uriBaseId = "SRCROOT" when (a) the run declares originalUriBaseIds["SRCROOT"] and (b) the entry has no caller-supplied mappedTo. Multi-entry provenance is left untouched; a caller-authored mappedTo is never overridden. emit-run stays host-agnostic.

Generator reconstructs provenance (no hand-patching)

CweGenerateSample.ps1 previously resolved repositoryUri live from the origin remote but hardcoded revisionId/branch to a frozen github sha. Copying the taxonomy into another repo (e.g. Azure DevOps) and running it produced an incoherent versionControlProvenance — a repositoryUri for repo X paired with a commit that exists only in github sarif-sdk — that had to be hand-patched before publishing.

The script now resolves the whole triple atomically, never per-field:

Default → full live reconstruction from the git working tree (origin remote, HEAD, abbrev branch), with actionable failures for a missing remote, an unborn HEAD, or a detached HEAD.
-Deterministic → the canonical fixture pin (the v4.5.0 commit 84f83c81…), so the checked-in fixtures regenerate byte-identically on any machine, commit, or fork. The byte-gate test passes this.
-RevisionId / -Branch layer onto live mode (explicit anchoring / detached HEAD). The -GHAzDO variant stamps BUILD_REPOSITORY_URI / BUILD_SOURCEVERSION / BUILD_SOURCEBRANCH from the same resolved values so AdoPipelineContext's field-by-field agreement check passes.

A new live-coherence [Fact] proves default mode stamps this repo's real HEAD (not the pin) and restores the fixture afterward, so the working tree stays clean.

Genchi genbutsu (end-to-end proof)

Copied the Taxonomies directory into a live Azure DevOps repository and regenerated with no patch:

SRCROOT anchored at the ADO repository root; revisionId auto-resolved to the ADO HEAD; branch refs/heads/main.
Validation 0 errors / 0 warnings / 0 notes for both variants; 0 machine-path leaks.
GHAzDO ingestion of the -GHAzDO variant returned HTTP 200.

Validation

Full Test.UnitTests.Sarif.Multitool.Library suite: 319 passed, 1 skipped (incl. 8 new ADO / credential / mixed-host visitor [Fact]s).
Cwe byte-identical fixture gate + live-coherence test: green.
ReleaseHistory style test: green.
Release build with EnforceCodeStyleInBuild (IDE0005-as-error, the CI gate): clean.

Deferred to #2972

Receipt-time VCP/mappedTo validation at emit-run; non-cloud ADO shapes (<org>.visualstudio.com, collection/virtual-directory) and GHE; submodule / multi-repo collision-suffix determinism; SSH→HTTPS normalization; test-design smell review.

…finalize Replace the caller-supplied `--srcroot` rewrite with an EmitFinalizeRebaseVisitor that deconstructs absolute local file paths into relative URIs plus portable, per-repository uriBaseIds anchored at GitHub blob permalinks derived from run.versionControlProvenance. The shipped SARIF carries no machine-specific path. emit-finalize now hard-requires each run to declare at least one versionControlProvenance entry whose mappedTo.uriBaseId (and only a uriBaseId) binds a declared originalUriBaseIds root with an absolute http(s) repositoryUri and a non-empty revisionId. A single repository collapses to the bare SRCROOT base; multiple repositories each receive SRCROOT_<REPO-LEAF>, disambiguated by an ordinal suffix on collision. Portable derivation is github.com-only in this release (default port; owner/repo path); other hosts fail with a clear, planned follow-up message. A local path that no declared root resolves fails finalize rather than shipping a machine-specific path. Rebasing runs after enrichment reads sources from the local file:// bases and before serialization; on failure the verb writes each error to stderr and returns FAILURE before any file is written. emit-run's auto-stamping now binds the single source-repo entry to the source root: it sets versionControlProvenance[0].mappedTo.uriBaseId = "SRCROOT" when the run declares originalUriBaseIds["SRCROOT"] and the caller supplied no mappedTo, staying hands-off for multi-entry provenance and never overriding a caller's binding. emit-run remains host-agnostic; only finalize is github-only. Rewrite CweGenerateSample.ps1 to drop --srcroot and anchor the fixture's revisionId at a real, immutable sarif-sdk commit (the v4.5.0 release tag) so the derived permalink resolves to a real blob; regenerate both Cwe sample fixtures. Fast-follow tracked in #2972: receipt-time validation, non-github hosts, multi-repo/submodule fixtures, and a test-design smell review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ruct sample provenance emit-finalize now derives a portable per-repository root for dev.azure.com repositories alongside github.com. The github host keeps its clickable /blob/<revisionId>/ permalink; the Azure DevOps host resolves to the commit-less dev.azure.com/<org>/<project>/_git/<repo>/ repository root, because ADO per-file web URLs are query-based (?path=...&version=GC<sha>) and cannot serve as a SARIF uriBaseId prefix (RFC 3986 relative resolution drops the query). Commit pinning therefore rides on versionControlProvenance.revisionId, which finalize preserves and GHAzDO ingestion reads. The host dispatch is hardened: a repositoryUri carrying credentials, a query, or a fragment fails closed, and diagnostics are sanitized so a secret can never surface in an error message. Repository-name segments are normalized through a single guard that strips a .git suffix and rejects empty / dotted / multi-segment names. github and ADO roots can be minted in the same run, each getting a distinct SRCROOT_<REPO> base. CweGenerateSample.ps1 previously resolved repositoryUri live from the origin remote but hardcoded revisionId/branch to a frozen github sha. Copying the taxonomy into another repository (e.g. Azure DevOps) and running it produced an incoherent versionControlProvenance — a repositoryUri for repo X paired with a commit that exists only in github sarif-sdk — that had to be hand-patched before it could be published. The script now resolves the whole triple atomically: the live git working tree by default (full reconstruction wherever the taxonomy is run), or the canonical fixture pin under -Deterministic. The two are never mixed. -RevisionId / -Branch layer onto live mode for detached-HEAD or explicit anchoring; the -GHAzDO variant stamps BUILD_REPOSITORY_URI / BUILD_SOURCEVERSION / BUILD_SOURCEBRANCH from the same resolved values so AdoPipelineContext's field-by-field agreement check passes. The byte-gate test passes -Deterministic so the checked-in fixtures regenerate identically; a new live-coherence test proves default mode stamps the real HEAD and restores the fixture afterward. Verified end to end by copying the taxonomy into a live Azure DevOps repository and regenerating with no patch: SRCROOT anchored at the ADO repository root, revisionId auto-resolved to the ADO HEAD, validation 0/0/0, and GHAzDO ingestion returned HTTP 200. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

A CI / pipeline checkout (GitHub Actions, Azure DevOps) lands on a detached HEAD, where `git rev-parse --abbrev-ref HEAD` returns `HEAD` and names no branch. Default-mode provenance reconstruction therefore threw, which broke the live-coherence test on the hosted runner and — more importantly — would break any consumer that copied this taxonomy into their own pipeline and ran it unattended, the exact "must hand-patch to get a valid SARIF" failure the reconstruction work set out to remove. When git cannot name the branch, fall back to the ref the CI system publishes for the same checkout: BUILD_SOURCEVERSION's companion BUILD_SOURCEBRANCH on Azure DevOps, GITHUB_REF on GitHub Actions. That ref points at the very commit HEAD is parked on, so it is still coherent live provenance — not the canonical-pin mixing the atomic resolver forbids. The vars are read before Set-AdoEnv scrubs them for the multitool subprocess, so the parent value is intact. Only when no CI ref exists either does the script fail, directing the caller to -Branch or -Deterministic. Verified by detaching HEAD locally with GITHUB_REF set: default mode resolves revisionId to the live HEAD and branch to the CI ref, validation 0/0/0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* emit-init-run: auto-stamp ADO pipeline automationDetails from env + GHAzDO sample (#2929) * emit-init-run: auto-stamp ADO pipeline automationDetails from env + GHAzDO sample Adds AdoPipelineContext, which detects an Azure DevOps pipeline execution context from the standard predefined environment variables and stamps run.automationDetails so producers that run inside ADO pipelines automatically satisfy GHAzDO1019 and GHAzDO1020 with no additional CLI flags. - TryDetect is three-state (None / Partial / Complete). Partial fails loudly with a per-variable diagnostic before any file-system side effects so a misconfigured pipeline never emits a half-stamped SARIF. - ApplyTo writes the canonical azuredevops/pipeline/build/<org>/<projectId>/<buildDefId>/<phaseId>/<branchRef>/<buildId> id and the four azuredevops/pipeline/build/* property keys ADO Advanced Security ingestion validates. - Composes with the existing --automation-guid / --automation-correlation-guid flags; never overwrites a producer-supplied guid/correlationGuid. CweGenerateSample.ps1 grows a -GHAzDO switch that produces the new CweGHAzDoSample.sarif fixture alongside the existing CweSample.sarif. The script populates the ADO env vars for the duration of emit-init-run so AdoPipelineContext stamps automationDetails, then patches tool.driver.fullName post-finalize so GHAzDO1018 passes. Default-mode runs explicitly clear those same env vars so a developer shell with TF_BUILD=True can never drift the AI-shape fixture. CweGHAzDoSample.sarif validates with zero errors, zero warnings, and zero notes under --rule-kind Sarif;AI;GHAzDO. CweGeneratedSampleTests covers both fixtures with byte-identical regression gates as separate [Fact]s sharing one private helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim ReleaseHistory bullets + add copilot-instructions.md The two bullets I just added for env-driven ADO stamping and the GHAzDO sample fixture were PR-description-sized, not release-note-sized. Trimmed both to match the style of their neighbors (single self-contained sentence + concrete names + minimal facts a downstream consumer needs). The full narrative — three-state detection prose, env-var precedence table, composition guarantees — already lives on PR #2929 where it belongs. Adds .github/copilot-instructions.md so future agents in this repo see the release-notes-vs-PR-description distinction up front, plus the house idioms that come up repeatedly in code review (no [Theory], GHAzDO casing, AI ruleId convention, sample-fixture convention, side-effects-after-detection, internals-via-InternalsVisibleTo). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Port SARIF AI generation guidance from ai-plugins to sarif-sdk (#2930) Make sarif-sdk the single source of truth for the SARIF spec markdown, the AI-generated-findings profile, and the agent skills that emit and validate AI SARIF. Adds: - docs/spec/sarif-v2.1.0-spec.md Convenience markdown rendering of the OASIS SARIF 2.1.0 specification (Plus Errata 01). The OASIS-published document is canonical; IPR notice preserved at top of file. - docs/ai/generating-sarif.md Normative guidance for representing AI/LLM-produced security findings as first-class SARIF: ai/origin declaration, tool identity, result structure, exploitability and attacker-position vocabulary, evidence model, redaction, notification taxonomy (AI/EXEC/*, AI/CFG/*), and the full AI rule-pack appendix. Includes a Mermaid object-model diagram in the appendix. - docs/ai/example.sarif Comprehensive reference SARIF log that conforms to the AI profile. Passes `dotnet sarif validate --rule-kind 'Sarif;AI'` cleanly. - skills/emit-sarif-findings/SKILL.md Agent-operating procedure for emitting AI SARIF using the Sarif.Multitool emit verbs (emit-init-run, add-result, add-notification, emit-finalize --validate). Multitool-only; cross-references docs/ai/generating-sarif.md as the normative source. - skills/validate-sarif-findings/SKILL.md Agent-operating procedure for validating AI SARIF. Uses `--rule-kind 'Sarif;AI'` against the multitool's AI rule pack (AI1003-AI2019) plus the standard SARIF rules in one pass. Updates: - README.md adds a short pointer section to the new spec, guidance, and skills directories. - docs/multitool-usage.md gains a 'Modes' table entry for each of the new emit verbs (emit-init-run, add-result, add-notification, emit-finalize) plus a worked example. Verification gates run before commit: - `dotnet sarif validate docs/ai/example.sarif --rule-kind 'Sarif;AI'` reports 0 errors. - End-to-end smoke test (init -> add-result -> finalize --validate) produces a SARIF file with 1 result, 1 rule (CWE-78 enriched from the embedded MITRE CWE taxonomy). - All skill command snippets match actual --help output for the relevant verb at Sarif.Multitool 5.0.0. Companion work (separate PR in microsoft/ai-plugins): - Delete plugins/sarif/ entirely; the canonical home is now this repository. - Retool Swallowtail (and other AI-detector plugins in ai-plugins) to invoke Sarif.Multitool emit verbs directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add PublishSampleToGhazdo.ps1 + clone-aware CweGenerateSample.ps1 (#2931) `CweGenerateSample.ps1` now derives `--vcp-repositoryuri` and the `emit-finalize --srcroot` prefix from `git -C $repoRoot remote get-url origin`, falling back to `https://github.com/microsoft/sarif-sdk` when origin is unset. On the canonical microsoft/sarif-sdk clone the generated fixtures (CweSample.sarif, CweGHAzDoSample.sarif) are byte-identical to the previous hardcoded form. GitHub origins get a `<repo>/blob/main/` SRCROOT prefix; other hosts (including ADO) get the bare repo URL with a trailing slash. Adds `src/Sarif/Taxonomies/PublishSampleToGhazdo.ps1` -- POSTs a gzipped SARIF to the GHAzDO SARIFs ingestion endpoint (`/{org}/{project}/_apis/alert/repositories/{repo}/sarifs?api-version= 7.2-preview.1` on advsec.dev.azure.com, fallback dev.azure.com). Target org/project/repo are parsed from runs[0].versionControlProvenance[0] .repositoryUri; PAT is read from the ADO_PAT environment variable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Scrub Microsoft-internal references from AI guidance port (#2932) Public-OSS hygiene pass on the SARIF AI guidance and skills. Descriptor ids that are already shipped in the SDK (AI/EXEC/ALAS-SIGNAL in AI2018.ProvideExecutionSignalArtifact and AI1014's AI/EXEC/* and AI/CFG/* prefixes) are kept as-is so the docs match the current SDK implementation. Changes: - Drop ALAS expansion and neutralize the signal-payload schema (descriptor id kept; no payload schema was ever enforced by the SDK). - Replace ProjectApi with FastAPI (five sites) in API-handler examples. - Replace 'Geneva cluster' with 'telemetry cluster' in a deployment example. - Replace example rule id SWT-CPP-001 with ACME-CPP-001. - Replace author: mikefan with sarif-sdk-maintainers in both skill frontmatters. - Soften a reference to an unpublished companion remediation guidance document. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add multitool add-reporting-descriptor verb Appends a fully-formed SARIF reportingDescriptor JSON object — supplied via --input <path> or stdin — to the staged event log produced by emit-init-run. Two targets: * Default → run.tool.driver.notifications[]. AI producers routinely emit notification descriptors (progress, telemetry, config errors). No id convention is enforced; notifications use opaque ids. * --rules → run.tool.driver.rules[]. Gated against AIRuleIdConvention.IsNovel so only NOVEL- novel-finding descriptors are accepted. Taxonomy-mapped rule descriptors (e.g., CWE-89) come from the taxonomy enricher at finalize time, not from this verb. Each descriptor id may appear at most once per event log. The verb scans the existing event log on receipt and rejects duplicates against either a prior add-reporting-descriptor event of the same target OR a descriptor pre-populated on the run-header. A --force escape hatch is acknowledged in error text but intentionally out of v1 scope. Event-log plumbing: * Adds SarifEventKinds.RuleDescriptor ("rule-descriptor") and SarifEventKinds.NotificationDescriptor ("notification-descriptor"), threaded through SarifEventLogReader's kind allow-list. * SarifEventReplayer buffers descriptor events and merges them into the target list BEFORE RegisterDescriptorsFromResults runs. This ordering matters: auto-registration synthesizes minimal descriptors only for ruleIds that aren't already represented, so an explicit NOVEL- descriptor pre-empts the minimal one. Header pre-populated descriptors are preserved by reference; the verb's emit-time dedup blocks id collisions between header and events. * New event kinds are additive within CurrentSchemaVersion = 1; older readers will skip unknown kinds harmlessly, matching the forward- compat shape used when Notification / Invocation kinds were added. Tests: * 16 [Fact] tests on AddReportingDescriptorCommand covering both happy paths (notifications default, --rules), id validation (missing/empty/ non-string), the NOVEL gate (taxonomy id rejection on --rules path only), rich payload round-trip (messageStrings, defaultConfiguration, helpUri, properties — including a date-shaped property string to guard against Json.NET DateTime coercion), duplicate detection within and across targets, duplicate detection against header-pre-populated descriptors for both target arrays, missing-wip-file path, and two malformed-input cases (bad JSON, non-object root). * 3 [Fact] tests on SarifEventReplayer covering: rule-descriptor events populating rules and pre-empting auto-registration, notification- descriptor events populating notifications, and the header-pre-populated + events merge semantics. No [Theory]/[InlineData] — repeated scenarios use shared private helpers (SeedRunHeader) per house style. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Strip editorial prefixes from AI notification taxonomy (#2934) Notification descriptor ids now name the concern only — `DECISION`, `RULED-OUT`, `DATA-ACCESS-DENIED`, `ALAS-SIGNAL`, `TOOL-UNAVAILABLE`, etc. The previous `AI/EXEC/*` and `AI/CFG/*` prefixes repeated context the surrounding SARIF already carries: the array (`toolExecutionNotifications` vs `toolConfigurationNotifications`) encodes the kind, and `tool.driver.name` encodes the emitter. The same id MAY now legally appear in both arrays. Suffixing `EXEC` or `CFG` on every id is like suffixing `Class` on every C# class — the surrounding context already says what kind of thing it is. Placement is selected at authoring time: `add-notification` defaults to `toolExecutionNotifications`; `add-notification --config` (`-c`) routes to `toolConfigurationNotifications`. The event-log kind `SarifEventKinds.Notification` splits into `ExecutionNotification` (`"execution-notification"`) and `ConfigurationNotification` (`"configuration-notification"`); the replayer routes each to the matching invocation array. `AI1014.ExecutionNotificationPlacement` is deleted. Its sole purpose was enforcing prefix-vs-array consistency, which is structurally meaningless under the new convention (the array IS the kind). `AI2018` retains its semantic; the literal id it checks changes from `AI/EXEC/ALAS-SIGNAL` to `ALAS-SIGNAL`. BRK by the letter of v4.6.3 (AI1014 was added there), but AI rules adoption is low and v5.x is the right place for refinement over back-compat. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Generalize ALAS-SIGNAL notification id to LEARNING-SIGNAL (#2935) ALAS named a specific consumer (an internal learning system). Under the convention shipped in #2934, notification ids name the concern, not the consumer. LEARNING-SIGNAL describes what the signal is, independent of who reads it. While here, rename the AI2018 rule from ProvideExecutionSignalArtifact to ProvideLearningSignalArtifact for consistency: the class checks the LEARNING-SIGNAL id, the "Execution" qualifier was redundant under the new convention (placement is encoded by the array, not the id), and downstream learning systems aren't necessarily reading only the execution-side array. Affects: AI2018 rule class + file + RuleId const + 3 resource keys/messages, the AI2018 row in docs and skills tables, and the UNRELEASED BRK bullet for #2934 (whose own "ALAS-SIGNAL example" becomes "LEARNING-SIGNAL", and which now documents the class-and-id rename together). BRK on the just-merged BRK (both still UNRELEASED) — favored over shipping the consumer name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Stamp ReleaseHistory UNRELEASED section as v5.0.0 (#2937) build.props was already bumped to <VersionPrefix>5.0.0</VersionPrefix> in #2924 (the SHA-1 BRK). This finishes the v5.0.0 cut by replacing the UNRELEASED placeholder header in ReleaseHistory.md with the canonical version banner (Sdk / Driver / Converters / Multitool / Multitool Library nuget links), matching the v4.6.4 format. Picked up by #2936 (the dev to main promotion PR). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim and split the over-descriptive v5.0.0 notification-taxonomy BRK (#2938) Per release-notes house style, bullets are one or two self-contained sentences; PR-description prose belongs in the PR. The original bullet was ~3x the length of its neighbors and re-litigated the motivation. Split into two tighter bullets: 1. Convention change + routing mechanism (id-prefix strip, new --config switch, event-kind split). 2. Rule-table changes (AI1014 removal, AI2018 rename). Drops the "prefixes were redundant because..." explanation, the wire value parentheticals (`"execution-notification"` etc.), and the "ALAS named a specific consumer" parenthetical. The change itself is visible in the renames; the reader doesn't need the rationale. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix CweGenerateSample.ps1 -GHAzDO crash inside real ADO pipelines (#2940) The mseng microsoft.sarif-sdk pipeline broke on the first build of main after the v5.x promotion (build 31555367). Symptom: CweGenerateSample.ps1 (args: -Configuration Release -GHAzDO) exited with code 1. ADO pipeline context is partially configured. Either populate every required variable or clear them all. Problems: BUILD_DEFINITIONID='1234' disagrees with SYSTEM_DEFINITIONID='9978' (both name the same pipeline identifier and must match) Root cause: the deterministic-fixture env override in CweGenerateSample.ps1 stamps BUILD_DEFINITIONID=1234 for byte-stable output, but does not also override SYSTEM_DEFINITIONID. ADO agents inject both. The verb's must-match cross-check in AdoPipelineContext.TryDetect (correctly) refuses to proceed when the two disagree. Fix the script (not the verb): add SYSTEM_DEFINITIONID alongside BUILD_DEFINITIONID in the \ ordered hashtable, plus SYSTEM_JOBID / SYSTEM_JOBNAME alongside SYSTEM_PHASEID / SYSTEM_PHASENAME for symmetric hygiene (those pairs are exempt from must-match but the default-mode \ cleanup loop iterates \ and benefits from covering the agent's full fallback set). The fixture SARIF bytes do not change — the primary env vars were already set and are the ones the verb actually reads. Regression gate: new CweGHAzDoSample_RegenerationSucceeds_WhenAmbientAdoFallbackEnvVarsConflict [Fact] explicitly seeds SYSTEM_DEFINITIONID / SYSTEM_JOBID / SYSTEM_JOBNAME with values that disagree with the script's deterministic primaries before invoking the script. Without the script fix it fails the same way the mseng build did; with the fix it passes byte-identical. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refresh v5.0.0 release-history layout + add prefix legend (#2939) Three changes, all in ReleaseHistory.md: 1. Add a prefix legend at the top of the file. Codifies the six prefixes (DEP / BRK / BUG / NEW / PRF / FUN) and the 'BRK leads each section' rule. Footnote notes that older sections may predate the convention. 2. Reorder the v5.0.0 section so all BRK bullets lead (BRK -> NEW -> BUG). Pure line shuffling; relative order preserved within each group. 3. Normalize the lone 'BUGFIX:' bullet in v4.6.4 to 'BUG:' (matches the legend's canonical form). The deep-history 'BUGFIX, BRK:' entry in the v1.x section is left alone — that's immutable shipped state. No code or schema changes; ReleaseHistory.md only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace emit-init-run flags with SARIF Run JSON contract A consumer agent reports the existing 14 typed CLI flags can't express multiple versionControlProvenance entries (one of which carries a properties bag documenting skills in play). Modeling every field as a flag explodes the surface; the peer emit verbs (add-result, add-notification, add-reporting-descriptor) already accept fully-formed SARIF JSON via --input or stdin for exactly this reason, with a documented rationale that applies more strongly to the run header than to a single result. Replace the v5.0.0 flag surface on emit-init-run with the same input/stdin payload contract. EmitInitRunOptions shrinks to three properties (OutputFilePath, InputFilePath, ForceOverwrite). The SarifEventReplayer's documented partial-Run shape (tool, language, columnKind, defaultEncoding, defaultSourceLanguage, originalUriBaseIds, versionControlProvenance, automationDetails, baselineGuid, redactionTokens, ...) is now reachable end-to-end through the verb. Receipt-time validators (no filesystem side-effects on rejection): required non-empty-string tool.driver.name; https-only tool.driver.informationUri and versionControlProvenance[].repositoryUri; https-or-file originalUriBaseIds["SRCROOT"].uri; canonical 8-4-4-4-12 automationDetails.guid/correlationGuid; exact-match ai/origin in {generated, annotated, synthesized}; SARIF-log-document rejection; parent-shape JSON-object enforcement at every nested accessor so a JValue indexer never throws into the broad catch. ADO stamping is now JToken-direct so producer-supplied SARIF fields outside the SDK typed Run model survive the wip-line append; the existing typed-Run materialization at emit-finalize is the documented boundary at which non-typed fields are dropped, consistent with every other SDK round-trip. AdoPipelineContext.ApplyTo(Run) becomes bool TryApplyTo(Run, out string error). It stamps automationDetails.id and the four azuredevops/pipeline/build/* properties only when absent and fails-with-diagnostic on per-field conflict. The previous unconditional-overwrite contract was inert in v5.0.0 (the flag surface couldn't supply those fields) but became a footgun once JSON input could. CweGenerateSample.ps1 rewrites its emit-init-run call to construct a PowerShell hashtable -> ConvertTo-Json -Depth 32 -Compress -> stdin pipe. Both CweSample.sarif and CweGHAzDoSample.sarif regenerate byte-identically (verified by CweGeneratedSampleTests, which gates the fixtures sha-256). skills/emit-sarif-findings/SKILL.md Step 1 is rewritten to show the JSON construction; the inputs table picks up the multi-VCP and properties-bag annotations; the package constraint bumps to Sarif.Multitool >= 5.1.0. docs/multitool-usage.md's flag example is replaced with the stdin form. ReleaseHistory.md gets a new v5.1.0 UNRELEASED section with three bullets: BRK on the flag-surface removal, BRK on AdoPipelineContext.ApplyTo, NEW on the JSON-payload contract. Verification: - dotnet build src/Sarif.Sdk.sln: 0 warnings, 0 errors. - Test.UnitTests.Sarif.Multitool.Library: 217 passed, 1 skipped. - Test.UnitTests.Sarif: 896 passed, 3 skipped. - Test.UnitTests.Sarif.Driver: 140 passed, 1 skipped. - CweGeneratedSampleTests (3): pass; both fixtures byte-identical. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Stamp ReleaseHistory v5.0.1 section with nuget links src/build.props was bumped to <VersionPrefix>5.0.1</VersionPrefix> in be6fb706 (the emit-init-run JSON-contract change). This finishes the v5.0.1 cut by replacing the UNRELEASED placeholder header in ReleaseHistory.md with the canonical version banner (Sdk / Driver / Converters / Multitool / Multitool Library nuget links), matching the v5.0.0 format. Folded into #2942 so main is shippable the moment the promotion lands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim v5.0.1 release-notes bullets to neighbor density The three bullets were 4-6 sentences each, embedding validator catalogs and finalize-time round-trip prose that belong in the PR description, not in ReleaseHistory.md. Repo style explicitly calibrates against the neighbors and asks for trim/split when a bullet exceeds ~3x — the BRK and NEW bullets here now sit at roughly the same density as the v5.0.0 rename bullets above them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add multitool add-invocation verb Mirrors add-result / add-notification / add-reporting-descriptor: takes a fully-formed SARIF Invocation JSON object via --input <path> or stdin and appends it to the staged event log as a SarifEventKinds.Invocation event. SarifEventReplayer strips run.invocations[] carried on the run header, so this verb is the only path producers have to populate the array. The verb imposes no schema beyond must be a JSON object (SARIF makes every field on Invocation optional); full-log shape validation lives in emit-finalize --validate. AddInvocationOptions / AddInvocationCommand follow the established pattern. Program.cs registers and dispatches the new verb. SKILL.md, docs/multitool-usage.md, and ReleaseHistory.md updated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Drop unused System.Text using in EmitInitRunCommandTests CI's BuildAndTest.ps1 invokes dotnet build with --no-incremental and /p:EnforceCodeStyleInBuild=true, which surfaces IDE0005 (unused using) as an error. Local Debug + default incremental builds skipped the check and let the unused System.Text directive ride into be6fb706. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix NOVEL ruleId example and clarify enricher ownership of region.snippet Two corrections to docs/ai/generating-sarif.md flagged by an external AI-authoring feedback session that retargeted against Sarif.Multitool@5.0.1: 1. The Novel-findings subsection used the slash form ('NOVEL/<sub-id>' on result.ruleId, bare 'NOVEL' on the descriptor) - which AddResultCommand and AddReportingDescriptorCommand reject at receipt. The canonical form (per docs/AI-RuleId-Convention.md and AIRuleIdConvention.s_novelPrefix) is the dash-flat 'NOVEL-<sub-id>'; descriptor.id and result.ruleId are byte-identical. The obsolete 'ruleIndex required for NOVEL' paragraph is removed - each NOVEL- now has a unique id, so the SARIF 3.19.23 non- unique-id workaround no longer applies. 2. The Code Context subsection told AI tools to populate region.snippet and contextRegion.snippet on every finding. emit-finalize already runs InsertOptionalDataVisitor with RegionSnippets | ContextRegionSnippets | ComprehensiveRegionProperties | Hashes, reading the file from disk and filling these fields itself. The producer SHOULD emit region.startLine and region.endLine; the enricher owns everything else. Pre-populating wastes tokens and drift-risks the consumer's view of the file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Restore RuleKind.Ado as [Obsolete] alias for RuleKind.GHAzDO (#2945) #2928 renamed RuleKind.Ado to RuleKind.GHAzDO without leaving a back-compat alias. Restore Ado as an [Obsolete] alias resolving to the same underlying value (4), so pre-rename source still compiles and '--rule-kind ado' continues to bind on the multitool CLI via the existing case-insensitive enum parser. The obsolete-warning steers new callers off the deprecated spelling without breaking them. Two new [Fact] tests pin the alias contract (same value; case-insensitive parse of 'ado' resolves to GHAzDO). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trigger Validate workflow on dev-targeted pushes and PRs PRs targeting dev were silently skipping the build-and-test / check-format / build-multitool-for-npm jobs because the workflow filter was scoped to main only. Adding dev to both the push and pull_request branch lists so CI gates dev work the same way it gates main work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Consolidate v5.0.1 release-notes for actual ship v5.0.1 was tagged but never published to NuGet, so the bullet that lived under an unreleased v5.0.2 header folds back into v5.0.1 — that's the version that will actually ship from this PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Make autocrlf-sensitive tests deterministic across line-ending configurations (#2859) On Windows agents with `core.autocrlf=input` (LF on disk, CRLF Environment.NewLine), several tests compared values normalized to `Environment.NewLine` against C# verbatim string literals whose embedded newlines were LF on disk. They passed on default Windows (`autocrlf=true`: CRLF on disk = CRLF NewLine) and on Linux/Mac (`autocrlf=input`: LF on disk = LF NewLine) but failed on the cross-grained combo that no CI configuration exercises. Two principled moves: 1. Boundary normalization in `TestAssetResourceExtractor.GetResourceText` — canonicalize the read text to `Environment.NewLine` (`\r\n` -> `\n` -> `Environment.NewLine`). This is the single point where text resources enter the test harness, so every consumer (`FileDiffingUnitTests`, `InsertOptionalDataVisitorTests`, and ad-hoc callers) inherits the normalization for free. 2. Rewrite the affected literal-string assertions to express newlines explicitly with `Environment.NewLine` rather than rely on the source file's on-disk line endings: `string.Join(Environment.NewLine, ...)` for multi-line bodies, `$@"...{Environment.NewLine}..."` for short ones. Touches `StackTests`, `WebRequestTests`, `WebResponseTests`, `AndroidStudioConverterTests`, `FortifyUtilitiesTests`. `InsertOptionalDataVisitor.txt` is embedded as a resource and hashed by a `[Trait(TestTraits.WindowsOnly, "true")]` test where the on-disk hash must remain stable, so `.gitattributes` pins that file to `eol=crlf`. Co-authored-by: Michael C. Fanning <mikefan@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Drop AI1015 ProvideRunDefaultSourceLanguage from AI authoring guidance (#2948) * Drop AI1015 ProvideRunDefaultSourceLanguage from AI authoring guidance The partition-by-language model proved insufficient as a baseline-updating strategy (the earlier hypothesis that we could replace AI results for one language while retaining results for another on receipt of a new log file). Remove the MUST that an AI run set 'run.defaultSourceLanguage' and partition by '(repository, branch, language)' tuple, along with the 'Run partitioning by language' section in 'docs/ai/generating-sarif.md' and the AI1015 row in the rule table. 'defaultSourceLanguage' remains an accepted optional SARIF Run field for viewer rendering; we simply no longer mandate it. No code references existed for AI1015, so this is a pure doc walkback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Reclassify AI1015 drop as BUG and trim bullet Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix SARIF1012 NRE when result.ruleId does not resolve (#2944) (#2949) `SARIF1012.MessageArgumentsMustBeConsistentWithRule` previously threw a `NullReferenceException` when `result.message.id` was set but `result.ruleId` did not resolve to a rule in `tool.driver.rules[]` (no match by id, ruleIndex, or hierarchical base). The null-guard at lines 52-53 inspected `currentRules` (the collection) rather than the resolved `rule` instance, and used a short-circuiting null-conditional check that silently fell through to the diagnostic-emit branch's indexer access. Replace the guard with an explicit three-prong check: rule == null || rule.MessageStrings == null || !rule.MessageStrings.ContainsKey(result.Message.Id) All three null cases now emit the existing `MessageIdMustExist` diagnostic with the unresolved rule id (or `null`). Regression fixture: adds a 4th result to `SARIF1012.MessageArgumentsMustBeConsistentWithRule_Invalid.sarif` that references a non-existent `NoSuchRule` with `message.id: AnyId`, plus the corresponding expected diagnostic at `startLine: 53`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace Generate-CweTaxonomy.ps1 with cross-platform Python regenerator (#2921) (#2950) The PowerShell-only 'scripts/Generate-CweTaxonomy.ps1' assumed a Windows / pwsh environment to regenerate the CWE taxonomy assets, which is friction for non-Windows contributors and the original complaint behind #2921. Replace it with a single 'scripts/generate_cwe_taxonomy.py' (Python 3, stdlib only) that runs identically on Linux, macOS, and Windows. The script: * Downloads 'cwec_latest.xml.zip' via 'urllib.request' into a temporary staging directory. * Extracts via 'zipfile' and locates the embedded 'cwec_v*.xml'. * Parses every weakness (handling the CWE XML namespace), buckets by Status, sorts by numeric ID, derives the SARIF2012-conformant Pascal-case identifier (preferring the parenthesized common name when present), resolves the View-1000 ChildOf Primary parent, and emits the four-section help markdown (Description / Extended Description / Common Consequences / Potential Mitigations). * Writes 'CweTaxonomy.sarif' and 'CweTaxonomy.brief.md' in place under 'src/Sarif/Taxonomies/' with UTF-8 (no BOM) and LF line endings so the embedded resources hash identically regardless of host OS. Default invocation requires no arguments and downloads from MITRE: python3 scripts/generate_cwe_taxonomy.py For offline / testing scenarios, '--xml path/to/cwec_v*.xml' bypasses the download; '--source-url' overrides the MITRE URL; '--output-dir' overrides the artifact destination. '.gitattributes' retains the LF pinning on the two generated artifacts. 'src/Sarif/Taxonomies/CweReadme.md' Regeneration section is rewritten to show the single Python invocation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Narrow SARIF1001 case-fold relaxation to AI-origin notification descriptors (#2951) (#2955) Initial revision of this fix universally relaxed SARIF1001's id/name comparison to 'StringComparison.Ordinal' on the grounds that SARIF v2.1.0 § 3.49.7 only forbids strict-identical pairs. Per maintainer review, that read overshoots: the case-fold comparison is an authorial SHOULD layered above the spec MUST, and SHOULD layers are precisely what validators add value with. Removing the SHOULD globally weakens the typo-catcher for hand-authored descriptors. The narrower carve-out: AI notification taxonomies (issue #2952) deliberately pair a SCREAMING-CAPS opaque id with the corresponding PascalCase end-user name (e.g. 'DECISION' / 'Decision'). That convention is machine-coordinated and specific to AI emitters; it does not apply to hand-authored rule and taxon descriptors. The cut is the intersection of two existing context signals: 1. 'IsAIOriginRun()' -- the run carries 'properties["ai/origin"]', the same gate used by SARIF2002, SARIF2009 (the literal peer rule for identifier conventions), SARIF2014, and SARIF2015. 2. 'Context.CurrentReportingDescriptorKind == Notification' -- the descriptor was reached via 'tool.driver.notifications[]', not 'rules[]' or 'taxa[]'. AI rule ids are constrained by AI1012 to 'BASE/sub-id' or 'NOVEL-<sub-id>' forms whose hyphens / slashes cannot case-fold-collide with any PascalCase name, so extending the carve-out to rules would be a no-op for AI and a regression for hand-authored taxonomies. Strict-identical comparisons remain unconditional everywhere (spec MUST). The Invalid functional fixture now spans three runs covering each boundary of the carve-out: run[0] non-AI rules RULE0001/RULE0001 (spec MUST) RULE0002/RULE0002 (spec MUST) run[1] AI-origin notifications STRICT/STRICT (spec MUST under AI) rules DECISION/Decision (rules-not-exempt) run[2] non-AI notifications DECISION/Decision (non-AI-not-exempt) The Valid functional fixture pairs a non-AI tool with hand-authored rules and an AI-origin tool whose notifications carry 'DECISION/Decision' and 'RULE-COVERAGE-GAP/RuleCoverageGap' -- the latter pair documents the taxonomy convention even though hyphens already keep it case-fold-distinct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add GHAzDO1021 (ProvideShortBranchNameInVcp) (#2954) (#2958) Add a GHAzDO validation rule for run.versionControlProvenance[].branch values that start with refs/<class>/, using ^refs/[^/]+/ to detect full-ref branch names and recommending the stripped short form. The AdvSec Service silently drops VCP entries whose branch is not a short branch name, so the rule is scoped only to versionControlProvenance[].branch. It deliberately does not flag the full-ref branch segment embedded in run.automationDetails.id, which remains part of the existing GHAzDO1020 contract. Add valid and invalid ValidateCommand fixtures covering short branch names and refs/heads, refs/tags, and refs/pull full refs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Enrich versionControlProvenance from CI pipeline env (ADO + GitHub Actions) (#2957) (#2959) EmitInitRunCommand currently stamps automationDetails.id and the four `azuredevops/pipeline/build/*` property keys when `TF_BUILD=True` is detected, but it does not lift anything into `versionControlProvenance` - even though `BUILD_SOURCEBRANCH` is already in hand. AdvSec ingestion relies on VCP for branch + revision attribution, and this gap means the data lands only when a producer hand-constructs the entry into the run-header JSON. The same gap exists for AI scanners running under GitHub Actions. Extend AdoPipelineContext to read the two optional argument vars `BUILD_REPOSITORY_URI` and `BUILD_SOURCEVERSION`, and derive a short branch name from the existing `BUILD_SOURCEBRANCH` by stripping any leading `refs/<class>/` segment (so `refs/heads/main` becomes `main`; `refs/pull/42/merge` becomes `42/merge`; `refs/tags/v1` becomes `v1`). The two new env vars are optional - absence does not degrade Complete -> Partial - but malformed presence does (URI must be absolute http(s); revision must match `^[0-9a-fA-F]{7,40}$`). Add a parallel `GitHubActionsContext` for `GITHUB_ACTIONS=true` that mirrors the same shape but is VCP-scoped (no pipeline-identity contract today): `GITHUB_SERVER_URL` + `GITHUB_REPOSITORY` compose the repository URI, `GITHUB_SHA` supplies the revision (same hex regex), and `GITHUB_REF_NAME` is preferred over `GITHUB_REF` (stripping the same `refs/<class>/` prefix). Custom GHES servers compose correctly via trailing-slash normalization. EmitInitRunCommand grows a `TryStampVcp` JObject-direct stamper that mirrors the existing `TryStampAdoContext` shape and operates on three input shapes: 1. `versionControlProvenance` absent or empty array -> synthesize a new entry only when `repositoryUri` was detected (anchor field). Branch/revision without a repository URI is informationally thin and the synthesized entry would not bind to a repo for ingestion. 2. `versionControlProvenance` contains exactly one entry -> enrich missing fields; fail with a per-field conflict diagnostic when any supplied field disagrees with the detected pipeline value. Repository URI equality treats scheme/host case-insensitively (RFC 3986) via `Uri.TryCreate` round-trip; branch and revision are byte-wise. 3. `versionControlProvenance` contains multiple entries -> leave untouched. The caller has declared a multi-repo shape and we refuse to guess which entry names the pipeline's source repo. A `TryResolveVcpFields` orchestrator layers the two sources before stamping: ADO is the higher-priority source per the documented "env-takes-priority" rule, GHA fills any gap where ADO is silent, and fields populated on both sources MUST agree or the verb aborts with a diagnostic naming both sources. The stamper itself is source-agnostic - it takes the resolved (repositoryUri, revisionId, branch) triple and does not know which env produced each field. Probe-before-write semantics on the single-entry path leave the JObject unchanged when a conflict is detected, matching the existing `TryStampAdoContext` contract - a half-stamped VCP is worse than a clean refusal. Why no disk-git fallback? The verb deliberately does not shell out to `git.exe` to recover this data from the working tree. The two CI envs cover both surfaces an AI scanner lands in, `add-result`'s producer-supplied JSON is the universal escape hatch for everything else, and adding a soft runtime dependency on `git.exe` (with its own failure modes around shallow clones, detached HEADs, and non-existent branches) would make the verb's output depend on disk state. Producers running locally outside CI either set the env vars themselves or populate VCP directly in the input JSON. CI fixture isolation `CweGenerateSample.ps1`'s `\` map gains the two new optional ADO vars (`BUILD_REPOSITORY_URI` set to the resolved git remote URL; `BUILD_SOURCEVERSION` set to the same zero-SHA placeholder the hardcoded VCP entry carries) so the -GHAzDO variant stamps a fully populated ADO env shape and AdoPipelineContext detects no conflict against the supplied VCP. The same map also adds `\` entries for `GITHUB_ACTIONS` / `GITHUB_SERVER_URL` / `GITHUB_REPOSITORY` / `GITHUB_SHA` / `GITHUB_REF_NAME` / `GITHUB_REF` so that ambient GitHub Actions env on the macOS CI runner (which sets a real `GITHUB_SHA`) cannot trip GitHubActionsContext into reporting a revisionId that conflicts with the zero-SHA placeholder. Without this scrubbing, both `CweSample_Sarif_IsByteIdenticalToCweGenerateSampleScriptOutput` and `CweGHAzDoSample_Sarif_IsByteIdenticalToCweGenerateSampleScriptOutput` break under macos-latest. The same property is now gated by `CweGHAzDoSample_RegenerationSucceeds_WhenAmbientGitHubActionsEnvVarsConflict`, the GHA-side parallel of the existing ambient-ADO regression test. Closes #2957. Tests: 7 new EmitInitRunCommandTests covering GHA-only stamping, ADO+GHA agreement, the gap-fill path, cross-source disagreement on each field, GHA partial-env refusal, and producer-supplied conflicts under GHA. 11 new GitHubActionsContextTests covering detection states, malformed inputs, REF_NAME vs REF precedence, and URI normalization. 1 new CweGeneratedSampleTests fact gating ambient-GHA-env isolation. The original 11 ADO VCP EmitInitRunCommandTests and 9 AdoPipelineContextTests still green. 263/264 in Test.UnitTests.Sarif.Multitool.Library and 4/4 CweGeneratedSampleTests pass locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Stamp ReleaseHistory v5.0.2 section with nuget links and bump VersionPrefix (#2960) Promotes the v5.0.2 work to ship-ready: - `src/build.props`: `VersionPrefix` 5.0.1 -> 5.0.2, `PreviousVersionPrefix` 5.0.0 -> 5.0.1. - `ReleaseHistory.md`: stamp the `v5.0.2` header with nuget links for Sdk / Driver / Converters / Multitool / Multitool Library (UNRELEASED -> shipped). - `skills/emit-sarif-findings/SKILL.md`: bump the recommended Sarif.Multitool minimum from 5.0.1 to 5.0.2. v5.0.2 is the first release where `emit-init-run` enriches versionControlProvenance from the CI pipeline environment (Azure DevOps + GitHub Actions), which the skill's required commit-sha / branch / repo-uri inputs are stamped from automatically. Six v5.0.2 bullets ship: * BRK: scripts/Generate-CweTaxonomy.ps1 -> scripts/generate_cwe_taxonomy.py (#2921 / #2950). * NEW: `emit-init-run` enriches `versionControlProvenance` from ADO + GitHub Actions env (#2957 / #2959). * NEW: GHAzDO1021 `ProvideShortBranchNameInVcp` (#2954 / #2958). * BUG: Drop AI1015 `ProvideRunDefaultSourceLanguage` (#2948). * BUG: SARIF1012 NRE on unresolved ruleId (#2944 / #2949). * BUG: SARIF1001 case-fold relaxation for AI notification descriptors (#2951 / #2955). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert GHAzDO1021 + short-form branch normalization (false premise) (#2962) The GHAzDO1021 rule (``ProvideShortBranchNameInVcp``) and the related short-form branch-name normalization in the ADO and GHA env enrichers were built on a false premise. The original observation -- that the GHAzDO/AdvSec ingestion service silently dropped ``run.versionControlProvenance[]`` entries whose ``branch`` started with ``refs/<class>/`` -- turned out to be job-processing latency misread as validation failure. A product engineer from the AdvSec team has since confirmed that ingestion accepts both short (``main``) and long (``refs/heads/main``) shapes, so there is nothing to warn about and nothing to normalize. This change reverts the lot before v5.0.2 ships: * Delete ``GHAzDO1021.ProvideShortBranchNameInVcp`` rule plus its four Valid/Invalid Inputs and ExpectedOutputs fixtures, the resx + Designer entries, and the ``RuleId`` constant. * ``AdoPipelineContext``: drop ``s_branchRefPrefixRegex``, ``BranchShortName``, and ``NormalizeBranchRef``. ``TryDetect`` passes ``BUILD_SOURCEBRANCH`` through verbatim; ``BranchRef`` is the sole branch property, used directly when stamping VCP. * ``GitHubActionsContext``: drop the ``GITHUB_REF_NAME`` fallback entirely (the runner always sets both env vars, so this is invisible in production but keeps the property honestly long-form). Rename ``BranchShortName`` -> ``BranchRef``; new ``TryReadOptionalBranchRef`` is a pass-through. * ``EmitInitRunCommand``: rename ``vcpBranchShortName`` -> ``vcpBranch``; ``TryResolveVcpFields`` / ``TryStampVcp`` parameter renames; doc comments updated. * Tests: ``ValidateCommandTests`` drops the two GHAzDO1021_* methods; ``AdoPipelineContextTests`` renames the four "BranchShortName_Strips*" tests to "Passes*Through" and asserts only ``BranchRef`` (long form); ``GitHubActionsContextTests`` switches to ``GITHUB_REF`` setup, drops the two RefName-preference tests in favour of three pass-through tests; ``EmitInitRunCommandTests`` updates four env setups and six assertions to the long form. * ``CweGenerateSample.ps1`` supplies ``branch = 'refs/heads/main'`` so the GHAzDO-variant sample agrees with the env-derived long form on cross-source check; ``CweSample.sarif`` and ``CweGHAzDoSample.sarif`` regenerated. * ``ReleaseHistory.md`` v5.0.2 UNRELEASED: drop the GHAzDO1021 NEW bullet; soften the VCP-enrichment bullet wording to reflect pass-through semantics (no short-form derivation). No version bump: v5.0.2 is unreleased (the dev->main promote PR is open and blocked); this lands as part of the same release window. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * AI fingerprints rule split + authored-region reconciliation + release-note guards (#2966) * Split AI fingerprints rule into AI1007 (Error) + AI2011 partial (Warning) The single Note-level AI2011 DoNotPersistFingerprints rule flagged both result.fingerprints and result.partialFingerprints with one message. Per the AI rule-id band convention (AI1xxx = MUST/SHALL = Error; AI2xxx = SHOULD = Warning/Note), persisted fingerprints are a hard MUST-NOT while partial fingerprints are advisory. Split accordingly: - AI1007 DoNotPersistFingerprints (Error): result.fingerprints only. - AI2011 DoNotPersistPartialFingerprints (Warning): result.partialFingerprints only. RuleId.AIDoNotPersistFingerprints now maps to AI1007; new RuleId.AIDoNotPersistPartialFingerprints maps to AI2011. Resources, tests, docs, and ReleaseHistory updated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review: moniker release-note rule ids + region rationale - Rewrite the v5.0.3 BRK entry in enquoted-moniker form (`AI2011.DoNotPersistFingerprints`) and drop implementation detail. - Extend the fingerprinting Rationale to credit region coordinates and region snippets (core + context) as matching signal. - Add ReleaseHistoryStyleTests.ReleaseHistory_RuleIdsUseQuotedMonikers: every rule id must be introduced by its `Id.Name` moniker at least once per entry. Remediate all pre-existing debt file-wide with era-correct names (ids get renumbered: old AI1007 = ProvideExploitability, etc.). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Reconcile authored regions via OverwriteExistingData (#1784) FileRegionsCache previously discarded any divergence between an authored region coordinate and the value computed from the source text (a no-op Assert), silently shipping regions that point at the wrong span. This makes OptionallyEmittedData.OverwriteExistingData govern that reconciliation: * UNSET (default, 'validate me'): a divergent authored coordinate (startLine/startColumn/endLine/endColumn/charOffset/charLength), or a character span past end-of-file, throws ArgumentException (paramName 'inputRegion') instead of being trusted blindly. * SET ('trust me / recompute'): the divergent authored coordinate is overwritten with the computed value. This also makes the flag honest: previously it only affected snippets/hashes/contents and never touched region coordinates. Absent (0) coordinates are still computed and filled exactly as before, so callers that omit coordinates are unaffected. FileRegionsCache.PopulateText- RegionProperties gains an overwriteExistingData parameter (default false); InsertOptionalDataVisitor threads the existing OverwriteExistingData flag into it. emit-finalize does not set OverwriteExistingData, so it validates authored regions for free. Closes #1784 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add SDK-owned AI emit-profile input schemas (result + run-header) with C# drift tests (#2967) Add two Draft 2020-12 JSON input schemas the get-schema verb serves for AI SARIF producers, under Sarif.Multitool.Library/GetSchema: * ai-result.schema.json - overlay on the SARIF result that pins the AI profile's ruleId grammar (taxonomy sub-id form plus the exclusive NOVEL- escape via a load-bearing if/then/else), requires message.markdown and a non-empty locations array with region.startLine >= 1, and bounces the index/identity state the multitool assigns at finalize. * ai-run-header.schema.json - overlay on the SARIF run that requires tool.driver.name, a non-empty versionControlProvenance, and properties[ai/origin], canonicalizes URI schemes and GUIDs, and bounces the header fields the replayer ignores. The schemas carry only concise, consumer-facing description annotations; the xUnit drift tests (AIInputSchemaDriftTests, one Fact per contract clause) are the single source of truth pinning each schema verdict to its C# rule. The tests validate with JsonSchema.Net (added to Directory.Packages.props) because the in-repo Microsoft.Json.Schema validator predates Draft 2020-12 and would silently no-op the load-bearing if/then/else routing. Also remove AI1011.RedactedRunMarker (the rule, its RuleId constant, resource strings, tests, and doc rows) - it is out of scope for the AI emit profile. Recorded as a BRK entry under v5.0.3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add AI emit-profile input schemas for the invocation and reporting-descriptor verbs (#2968) * Add AI emit-profile input schemas for notification, invocation, and descriptor verbs Author SDK-owned Draft 2020-12 input schemas for the remaining AI emit verbs, each pinned to its C# receipt contract by an xUnit drift fact: - ai-notification.schema.json / ai-invocation.schema.json: any JSON object (the verbs validate only "payload is an object" at receipt; richer cross-document checks run at emit-finalize --validate). - ai-reporting-descriptor.schema.json: object requiring a non-empty string id (minLength 1 matches the verb IsNullOrEmpty gate; whitespace id accepted). - ai-rule-descriptor.schema.json: the --rules path, additionally gating id on the NOVEL- prefix (IsNovel), prefix-only by design. All four are standalone (no SARIF base $ref) because the verbs deliberately do not enforce the base required fields (notification.message, invocation.executionSuccessful); $ref'ing the base would over-constrain. Wire the four JSONs into the test csproj copy ItemGroup; add four drift facts (16 total, green). Add a NEW ReleaseHistory entry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor AI reporting-descriptor schema into base + novel overlay Collapse the two add-reporting-descriptor input schemas to a single base contract plus a minimal overlay, matching the verb's actual shape: there is ONE reportingDescriptor object; the --rules flag only chooses its storage target (rules[] vs notifications[]). - ai-reporting-descriptor.schema.json: the general descriptor contract (object + non-empty string id, mirroring the verb's IsNullOrEmpty gate). - ai-novel-rule-descriptor.schema.json: a minimal overlay that $refs the base and adds the single --rules-path tightening — id must start with NOVEL- (AIRuleIdConvention.IsNovel, prefix-only). Composition reads "a novel rule descriptor IS a reporting descriptor whose id starts with NOVEL-", keeping one source of truth for the descriptor shape. Removes the earlier ai-rule-descriptor.schema.json. Drift tests stay at 16 facts; the overlay's inherited non-empty-id rejects confirm its $ref resolves at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim v5.0.3 NEW entry under the 300-char terseness cap Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove add-notification; notifications travel inline on atomic add-invocation SARIF has no run-level notifications array, so a flat serial event log cannot route a streamed notification to one of several parallel invocations. Make add-invocation the sole carrier of notifications: the producer holds per-process state and emits ONE fully-formed invocation (with inline toolExecutionNotifications / toolConfigurationNotifications) when the process finishes. Parallelism is then free and the replayer is deterministic. - Remove add-notification verb + AddNotificationCommand/Options + schema/tests. - Remove execution-notification / configuration-notification event kinds; drop the replayer notification buffers, synthetic-invocation attach, and clock reads. - add-invocation now requires executionSuccessful (bool) + non-whitespace commandLine, and auto-stamps endTimeUtc + any unset inline notification timeUtc using a single wall-clock now (producer values preserved). - Enrich ai-invocation.schema.json to a rich overlay; update drift tests. - Migrate CweGenerateSample.ps1 to emit the notification inline on an add-invocation payload; regenerate CweSample.sarif / CweGHAzDoSample.sarif. - Update doc-comments, CweReadme.md, and ReleaseHistory.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix ReleaseHistory terseness gate; apply PR review trims - ReleaseHistory.md: trim the add-notification-removal BRK entry under the 300-char terseness gate (ReleaseHistoryStyleTests was red on all 3 platforms); reduce the NEW schema entry to the terse one-sentence form and drop the removed add-notification verb from it (PR review). - ai-reporting-descriptor.schema.json: trim the description and drop the NOVEL- rule sentence (that gate lives in the ai-novel-rule-descriptor overlay) (PR review). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Require workingDirectory on the AI invocation profile Per PR review: the add-invocation overlay now requires a workingDirectory artifactLocation (with a non-whitespace uri), in addition to executionSuccessful and commandLine. The AI producer knows its working directory and it anchors the relative paths in the scan, so the profile surfaces and requires it even though SARIF makes it optional. endTimeUtc stays auto-stamped (not required); we keep startTimeUtc/exitCode permitted-but-optional via the base $ref. - ai-invocation.schema.json: add workingDirectory to required + a property override requiring a non-empty uri; trim the description. - AddInvocationCommand: extend the receipt gate to reject a missing/empty workingDirectory.uri; update doc-comment. - AIInputSchemaDriftTests + AddInvocationCommandTests: pin the new accept/reject cases (missing-workingDirectory, workingDirectory-without-uri, empty uri). - CweGenerateSample.ps1 + regenerated CweSample/CweGHAzDoSample fixtures: the sample invocation now carries workingDirectory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Make notification timeUtc producer-owned; fix invocation workingDirectory artifact handling Notification timestamps are now producer-owned and required: the add-invocation verb never auto-stamps notification timeUtc. It is required by the input schema overlay and enforced by the C# receipt gate. Only endTimeUtc is auto-stamped (receipt ~ process end). Sample now carries a real workingDirectory ({uri, uriBaseId}) plus an arguments[] array to exercise rich invocation structure. Two directory-related bugs fixed (the second was masked by the first): - FileRegionsCache.GetHashData returns null (not the sha-256 of the empty string) for a path resolving to a directory or missing file. - AddFileReferencesVisitor no longer promotes an invocation's workingDirectory into run.artifacts -- it is process context, not a scanned artifact. This removes a location-only artifact (SARIF2004) that surfaced once the bogus empty hash was gone. Regenerated CweSample.sarif and CweGHAzDoSample.sarif: 0 errors/warnings/notes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten AI ruleId grammar to CWE-only with lowercase-kebab sub-ids Per maintainer decision, AI result.ruleId is now CWE-only: CWE-<number>/<sub-id> or NOVEL-<sub-id> where <sub-id> is lowercase-alphanumeric kebab (single hyphens, no leading/trailing/consecutive hyphen). CVE, OWASP, and mixed-case sub-ids are no longer accepted. Because CWE- and NOVEL- prefixes are now disjoint, the result schema's if/then/else ruleId routing is exactly equivalent to a plain anyOf, so it is flattened and the now-false IsNotFlattenableToAnyOf drift test is removed. The descriptor authoring gate stays prefix-only by design. Surfaces synced: AIRuleIdConvention(.Exception), ai-result.schema.json, convention + drift tests (case-for-case), docs, stale prose, and a breaking ReleaseHistory entry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review: tighten novel-descriptor gate, scrub stale prose Resolve the unresolved review threads on the AI emit-schemas work: - Tighten the add-reporting-descriptor --rules gate from a NOVEL- prefix check to the full novel grammar (IsNovel + IsAcceptable), so a rule descriptor id is byte-identical to the result ruleId that references it. Mirror the tightening in ai-novel-rule-descriptor.schema.json and in the drift / command tests. - Emit "true"/"false" via a ternary instead of ToString().ToLowerInvariant(). - Drop redundant minLength:1 where paired with pattern \S in ai-invocation.schema.json; correct the workingDirectory uri prose to non-whitespace. - Add the section 3.49.3 spec ref to the novel-descriptor id description. - Scrub historical add-notification removal narration from CweGenerateSample.ps1 and the AddInvocationCommand / SarifEventKinds doc comments; restate as positive invariants. - Remove an obvious one-line comment in SarifEventReplayer. - Add a BRK ReleaseHistory bullet for the gate tightening. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Simplify emit doc comments to describe current design Rewrite the invocation/notification doc comments and inline comments across AddInvocationCommand, SarifEventKinds, SarifEventReplayer, and AddReportingDescriptorCommand to state the current, proper design plainly. Remove reverse-mirror justifications (why notifications live inline rather than as a standalone event, why an object carries its properties) and the process-completion-proxy/wall-of-text framing; keep only the load-bearing invariants a reader needs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Clarify emit/AI comments and docs; scrub stale add-notification references Editorial pass over the AI emit-profile surface. The changes are comment- and documentation-only: no program behavior, output strings, or public signatures change. Comments: - Strip defensive, reverse-mirror, and over-descriptive prose from the Emit verb/option doc comments and SarifEventKinds / SarifEventReplayer so each comment plainly states the current design rather than narrating obvious code, restating what was just said, or justifying why a past mistake was fixed. - Remove complement-stating clauses (what isn't accepted) from the AI rule-id convention docs once the positive rule is stated. - Preserve load-bearing intent: the add-invocation receipt contract (producer-supplied timeUtc on inline notifications, auto-stamped endTimeUtc), the FileRegionsCache reconciliation pointer, and the emit-init-run ADO-seeds / GHA-fills-gaps VCP precedence rule. Docs and skills: - Drop the removed add-notification verb from multitool-usage.md and rewrite the routing guidance in generating-sarif.md and the emit/validate skills to the current model: notifications travel inline on the add-invocation payload, with placement selected structurally by the toolExecutionNotifications / toolConfigurationNotifications array. Other: - Normalize CWE placeholder ids to CWE-NNNN (four digits is the catalog maximum, CWE-1434) wherever a placeholder, not a concrete example, is meant. - Add .github/prompts/comment-editor.prompt.md, a reusable adversarial comment-editor prompt codifying this editorial standard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Co-locate endTimeUtc SARIF formatting in StampEndTimeUtcIfOmitted StampEndTimeUtcIfOmitted now takes the receipt DateTime and performs the SARIF date-string formatting itself, rather than receiving a pre-formatted string from Run(). The format construction lives next to the stamp, so the SARIF-friendly representation is verifiable at the point of use. Output is unchanged. The producer-supplied endTimeUtc remains a pass-through string: payloads are read with DateParseHandling.None to preserve ISO-8601 text exactly, so only the value the verb itself authors is formatted here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim over-length add-reporting-descriptor release note under 300 chars ReleaseHistory_CurrentSectionEntriesAreTerse flagged the add-reporting-descriptor --rules bullet at 389 chars. Elide the rejection-mechanics list (bare NOVEL-, slashes, uppercase tails, trailing hyphens) — that detail lives in the API docs and PR — and keep the consumer-facing essentials: the full NOVEL- grammar gate and that the descriptor id is byte-identical to the ruleId that references it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Correct workingDirectory artifacts rationale: no added data, not role The prior wording justified keeping an invocation's workingDirectory out of run.artifacts on the grounds that it is "process context, not a scanned artifact." That is wrong: SARIF defines a directory artifact role (ArtifactRoles.Directory, 3.24.6), so a directory is a legitimate artifact. The real reason is that the location is bare -- no hashes, contents, or length -- so an artifacts-table entry would carry nothing, which is equally true for process context and scanned artifacts. Reword the code comment and the release note accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Trim add-reporting-descriptor note to the grammar-gate essential Drop the byte-identical-ruleId follow-on; the grammar-gate change is the consumer-facing fact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add get-schema verb and align AI emit verb/schema names (#2971) * Add get-schema verb and align AI emit verb/schema names Introduce `multitool get-schema <verb>`, which streams the embedded JSON Schema that validates a named emit verb's input to stdout or `--output`, with `--list` to enumerate the servable verbs. The schemas served are the same bytes the emit verbs validate against, so a producer can fetch the contract for the exact verb it is about to call. Bring the AI emit surface into verb<->schema naming parity: - Rename `emit-init-run` to `emit-run` (schema `ai-run`). - Split `add-reporting-descriptor` into `add-notification-reporting-descriptor` and `add-rule-reporting-descriptor`, removing the `--rules` flag; the verb now selects the target array and id gate. Shared body lives in the internal `ReportingDescriptorEmitter`. - Rename the input schemas to mirror their verbs: `ai-run`, `ai-result`, `ai-invocation`, `ai-notification-reporting-descriptor`, `ai-rule-reporting-descriptor`. The rule schema is now standalone (no cross-schema `allOf`). `emit-finalize` is reserved in the get-schema catalog with no schema this release; its whole-log `ai-log` schema is the fast-follow tracked in #2970. Tests: a new in-sync test pins every catalog verb to bytes identical across the embedded resource, the on-disk `GetSchema/` file, and what the verb writes, and asserts catalog<->embedded-manifest parity so the surface cannot silently drift. The drift and descriptor tests are updated for the renames and split. Cwe sample fixtures regenerate byte-identical. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR #2971 review: emit options base, SRCROOT disk check, verb doc-comment wording, SKILL descriptor verbs - Factor a shared EmitInputOptionsBase (OutputFilePath + InputFilePath) and inherit it from the five emit verb option classes; EmitRunOptions keeps only --force-overwrite. - emit-run now rejects a file: SRCROOT whose path does not resolve to an existing directory on disk at receipt, so finalize can enrich against an observable checkout. Adds passing/failing receipt tests. - Correct the command doc-comment summaries to 'Implements <verb>' (the dotnet tool command is 'sarif', not 'multitool'). - SKILL.md: reference add-notification-reporting-descriptor and add-rule-reporting-descriptor in the verb list and add a descriptor step; note the file: SRCROOT disk-existence requirement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Derive portable artifact roots from versionControlProvenance at emit-finalize (#2973) * Derive portable artifact roots from versionControlProvenance at emit-finalize Replace the caller-supplied `--srcroot` rewrite with an EmitFinalizeRebaseVisitor that deconstructs absolute local file paths into relative URIs plus portable, per-repository uriBaseIds anchored at GitHub blob permalinks derived from run.versionControlProvenance. The shipped SARIF carries no machine-specific path. emit-finalize now hard-requires each run to declare at least one versionControlProvenance entry whose mappedTo.uriBaseId (and only a uriBaseId) binds a declared originalUriBaseIds root with an absolute http(s) repositoryUri and a non-empty revisionId. A single repository collapses to the bare SRCROOT base; multiple repositories each receive SRCROOT_<REPO-LEAF>, disambiguated by an ordinal suffix on collision. Portable derivation is github.com-only in this release (default port; owner/repo path); other hosts fail with a clear, planned follow-up message. A local path that no declared root resolves fails finalize rather than shipping a machine-specific path. Rebasing runs after enrichment reads sources from the local file:// bases and before serialization; on failure the verb writes each…

michaelcfanning and others added 3 commits June 5, 2026 07:44

michaelcfanning marked this pull request as ready for review June 6, 2026 16:01

michaelcfanning requested a review from cfaucon as a code owner June 6, 2026 16:01

michaelcfanning merged commit 49208ee into dev Jun 6, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive portable artifact roots from versionControlProvenance at emit-finalize#2973

Derive portable artifact roots from versionControlProvenance at emit-finalize#2973
michaelcfanning merged 3 commits into
devfrom
ai-emit-vcp-rebase

michaelcfanning commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelcfanning commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contract (hard-enforced at finalize; FAILURE before any file is written)

emit-run stamping

Generator reconstructs provenance (no hand-patching)

Genchi genbutsu (end-to-end proof)

Validation

Deferred to #2972

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelcfanning commented Jun 5, 2026 •

edited

Loading