Skip to content

Add Daily Dedupe Digest workflow#48244

Open
niels9001 wants to merge 1 commit into
microsoft:mainfrom
niels9001:user/niels9001/dedupe-digest
Open

Add Daily Dedupe Digest workflow#48244
niels9001 wants to merge 1 commit into
microsoft:mainfrom
niels9001:user/niels9001/dedupe-digest

Conversation

@niels9001
Copy link
Copy Markdown
Collaborator

Summary

Adds .github/workflows/dedupe-digest.yml, a daily scheduled workflow (08:00 UTC + workflow_dispatch) that maintains exactly one open [Dedupe Digest] YYYY-MM-DD issue assigned to @niels9001. Goal: give triagers a single place to manually review duplicate candidates each day.

Behavior

Each run:

  1. Finds all open digests filtered by both the dedupe-digest label AND the title prefix.
  2. Aggregates carry-overs from those digests via stable HTML-comment markers <!-- candidate:NEW=N ORIG=O -->. Drops candidates where:
    • the new issue is now closed, or
    • the new issue is now labeled duplicate, or
    • the suggested original issue is closed/missing/a PR.
  3. Discovers fresh candidates by asking gpt-4o-mini to compare each new issue (since the lookback window) against a pool of recently-updated open issues. Bias the candidate pool toward issues sharing a Product-* label when one exists.
  4. Categorizes each candidate:
    • 🔴 AI-flaggedautomatic-issue-deduplication.yml already applied the duplicate label
    • 🟡 Needs review — model returned high/medium confidence but no duplicate label yet
    • 🟢 Low confidence
  5. Creates the new digest issue (titled [Dedupe Digest] YYYY-MM-DD, labeled dedupe-digest, assigned to the configured user).
  6. Closes all prior open digests with a Superseded by #N comment.

If no carry-overs and no new candidates: skips creation; an open prior digest (if any) stays open until manually closed.

Coexistence with existing automation

  • Augments, does not replace automatic-issue-deduplication.yml. That workflow keeps applying the duplicate label per-issue; the digest just surfaces what it did (plus medium/low-confidence candidates that the per-issue action didn''t flag).
  • No conflict with Fabric Bot. The 1-day auto-close in resourceManagement.yml fires only on Resolution-Duplicate (set by human /dup), not on the duplicate label applied by the AI action. The digest does not touch either label.
  • Bootstraps the dedupe-digest label idempotently on first run.

Configuration (workflow env)

DIGEST_ASSIGNEE: niels9001
DIGEST_LABEL: dedupe-digest
DIGEST_TITLE_PREFIX: "[Dedupe Digest]"
LOOKBACK_HOURS: "26"               # 24h + 2h overlap
MAX_CANDIDATES_PER_DIGEST: "40"
CANDIDATE_POOL_SIZE: "200"

Safety

  • Issue titles and model-produced reasons are sanitized before being written to the digest body, so attacker-controlled text cannot inject fake carry-over markers into tomorrow''s digest.
  • Model output is post-filtered: every suggested "original" issue number must exist in the candidate pool we sent (PRs and digest issues themselves are pre-excluded).
  • Strict JSON-only parsing of model responses; any malformed reply is ignored, not applied.

Companion PR

The PR auto-labeler is in a separate PR for focused review.

Adds .github/workflows/dedupe-digest.yml: a scheduled daily workflow
(08:00 UTC, plus workflow_dispatch) that maintains a rolling
"[Dedupe Digest] YYYY-MM-DD" issue assigned to niels9001(configurable
via env) so duplicates can be reviewed manually in one place.

Each run:
- Aggregates carry-overs from all open digests via stable HTML-comment
  markers <!-- candidate:NEW=N ORIG=O -->. Drops candidates whose new or
  original issue is closed, or whose new issue already bears the
  'duplicate' label.
- Discovers fresh candidates by asking gpt-4o-mini to compare each new
  issue against a pool of recently-updated open issues (biased toward
  candidates sharing a Product-* label when present).
- Categorizes candidates as AI-flagged (already labeled 'duplicate' by
  automatic-issue-deduplication.yml), needs-review, or low-confidence.
- Creates the new digest, then comments "Superseded by #N" and closes
  ALL prior open digests.

Bootstraps the dedupe-digest label idempotently on first run.

Augments (does not replace) the existing automatic-issue-deduplication
workflow; coexists with Fabric Bot's Resolution-Duplicate auto-close
(which only fires on human /dup, not the AI-applied 'duplicate' label).

Sanitizes attacker-controlled text (issue titles, model-produced reasons)
to prevent injection of fake carry-over markers into future digests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@niels9001 niels9001 requested a review from a team as a code owner June 1, 2026 14:57
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

@check-spelling-bot Report

🔴 Please review

See the 📂 files view, the 📜action log, 👼 SARIF report, or 📝 job summary for details.

Unrecognized words (2)

dedup
niels

These words are not needed and should be removed Dedup DWRITE LWIN nonstd VCENTER VREDRAW

To accept these unrecognized words as correct and remove the previously acknowledged and now absent words, you could run the following commands

... in a clone of the git@github.com:niels9001/PowerToys.git repository
on the user/niels9001/dedupe-digest branch (ℹ️ how do I use this?):

curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/cfb6f7e75bbfc89c71eaa30366d0c166f1bd9c8c/apply.pl' |
perl - 'https://github.com/microsoft/PowerToys/actions/runs/26762891586/attempts/1' &&
git commit -m 'Update check-spelling metadata'
Warnings ⚠️ (1)

See the 📂 files view, the 📜action log, 👼 SARIF report, or 📝 job summary for details.

⚠️ Warnings Count
⚠️ duplicate-pattern 2

See ⚠️ Event descriptions for more information.

If the flagged items are 🤯 false positives

If items relate to a ...

  • binary file (or some other file you wouldn't want to check at all).

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

cancel-in-progress: false

env:
DIGEST_ASSIGNEE: niels9001
async function askModelForDupes(newIssue, pool) {
const token = process.env.GITHUB_TOKEN;
if (!token) {
console.log('GITHUB_TOKEN is not set; skipping AI dedup.');
@niels9001 niels9001 added the Area-GitHub workflow Issues regarding the GitHub workflow and automation label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-GitHub workflow Issues regarding the GitHub workflow and automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants