feat(email): expose LLM triage/summary via /v1/email/triage by itomek · Pull Request #1547 · amd/gaia

itomek · 2026-06-09T13:55:11Z

Why this matters

Before: POST /v1/email/triage was heuristic-only — its own docstring said "No LLM is invoked." External consumers got rule-based categorization and a first-two-sentences summary, never the LLM-assisted triage the milestone actually built (#1107, #1266). The public API reached only half of what the agent can do.

After: an opt-in ?engine=llm runs the heuristic first and, when it's low-confidence, escalates the category via the agent's existing classify_email_llm and replaces the summary with summarize_email_llm — over the local Lemonade model only. The default ?engine=heuristic is byte-unchanged: no added latency, no LLM loaded, contract shape preserved. Email content never leaves the machine — cloud providers are hard-pinned off (use_claude=False, use_chatgpt=False) and a static gate (util/check_email_agent_local_only.py) enforces AC3.

Test plan

Unit/API (mocked chat): python -m pytest tests/unit/agents/email/ tests/integration/test_email_agent_local_only.py tests/test_api.py -q — engine=llm returns an LLM category+summary; default engine=heuristic is byte-unchanged and makes no LLM call; a cloud base_url is rejected loudly; an invalid engine value → 422.
AC3 static gate: python util/check_email_agent_local_only.py exits 0.
Real-world (Linux, AMD): API server + live Lemonade — engine=llm on a low-confidence sample email returns an LLM-derived category/summary that differs from the heuristic baseline; no cloud egress. (evidence below)
Real-world (Windows, AMD): same. (evidence below)

Real-world evidence

Pending — Linux + Windows runs in progress; logs/screenshots to follow.

…1452) Four test classes covering the four acceptance criteria: - AC1: engine=llm escalates low-confidence heuristic inputs via LLM - AC2: default engine=heuristic is byte-unchanged, no LLM call - AC3: cloud base_url raises loudly with no silent fallback - AC4: enforcement artifact files exist and are importable

…1452) Adds an opt-in ?engine=llm query param to the triage endpoint. When the heuristic confidence is low, the category is escalated via classify_email_llm and the summary is replaced by summarize_email_llm, both calling the LOCAL Lemonade server. The default engine=heuristic path is byte-unchanged. Cloud base_urls raise ConfigurationError loudly (AC3). Also codifies util/check_email_agent_local_only.py as a static lint gate for the local-only contract.

Type the ?engine query param as Literal["heuristic", "llm"] so FastAPI rejects an unknown value with a clean 422 at the API boundary — the partner app must not see a server error for a bad query param. The service-level ValueError guard stays as defense-in-depth for direct callers. Also drops an unused MagicMock import in the local-only integration test.

github-actions · 2026-06-10T14:47:52Z

Review — `feat(email): expose LLM triage/summary via /v1/email/triage`

Summary

This is a clean, well-scoped opt-in: ?engine=llm escalates only low-confidence heuristic results through the local model, and the default engine=heuristic path is genuinely untouched (verified by a byte-for-byte baseline test). Reuse is excellent — it threads the agent's existing classify_email_llm / summarize_email_llm / heuristic categorizer and reuses _build_draft + _extract_action_items rather than duplicating logic, so the LLM path mirrors the heuristic path exactly.

The one thing worth your attention before this can be cited as "AC3-enforced": the two dedicated enforcement artifacts don't actually run anywhere. The static gate (util/check_email_agent_local_only.py) is never invoked by CI — lint.yml only runs lint.py --all; the util/check_*.py path filter just re-triggers the lint job, it doesn't execute the script (contrast check_doc_links.py / check_doc_versions.py, which are explicitly run). And the new egress integration test isn't picked up by any workflow (integration jobs are file-scoped and this file isn't listed). The runtime guarantee still holds — cloud providers are hard-pinned off (use_claude=False, use_chatgpt=False) and the cloud-base_url rejection is unit-tested and runs via test_unit.yml — so this is an enforcement-by-test gap, not a security hole. But the PR sells the static gate as the AC3 mechanism, and right now it's documentation.

Issues

🟡 Important — AC3 enforcement artifacts are inert (not executed)
The static gate and the egress integration test are only asserted to exist / be importable; nothing runs them. test_static_lint_gate_module_is_importable (tests/unit/agents/email/test_llm_engine_triage.py:283) loads the module and checks for a callable, but never calls check()/main() nor asserts a 0 exit. So a future change that introduces a cloud import or a use_claude field would pass CI green.

Cheapest fix — make the existing unit test actually run the gate:

    def test_static_lint_gate_passes(self):
        import importlib.util

        spec = importlib.util.spec_from_file_location(
            "check_email_agent_local_only",
            _REPO_ROOT / "util" / "check_email_agent_local_only.py",
        )
        mod = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(mod)  # type: ignore[union-attr]
        assert mod.check() == 0, "AC3 static gate reported a violation"

For the egress integration test, either add a CI step that runs it or fold the egress-detector assertion into the unit file (it already mocks the chat client, so it needs no live backend and can live under tests/unit/). Without one of these, "a static gate enforces AC3" overstates what's wired up.

🟡 Important — engine=llm LLM failures surface as a bare 500 (src/gaia/api/email_routes.py:338)
classify_email_llm / summarize_email_llm raise LLMTriageError / EmailSummarizeError on transport failure, and _build_result_llm lets them propagate uncaught to the endpoint — so a transient local-Lemonade hiccup returns an opaque 500 Internal Server Error to the partner app. This file already establishes the right pattern at resolve_*_backend (email_routes.py:588): translate the boundary exception into a structured 503 with an actionable detail. Wrapping the engine=llm call the same way (e.g. 503 "local LLM unavailable — is Lemonade running?") keeps it loud and actionable, instead of loud-but-opaque. Comment-only — there are valid choices on the exact status code, so I'd rather you decide than hand you a diff.

🟢 Minor — high-confidence spam still pays an LLM summarize round-trip (email_routes.py:354)
When the heuristic is confident (spam/promotions), classify is correctly skipped — but summarize_email_llm is always called regardless. That matches the PR's "replaces the summary" wording, so it may be intentional; flagging only so you confirm you want an LLM call for mail that's already confidently spam.

🟢 Minor — misleading comment on max_tokens (email_routes.py:137)
max_tokens caps generation length, not prompt capacity, so "Keep tokens generous enough for the triage system prompt + body" reads backwards. Suggest: # Output budget for the triage summary; the prompt+body are input, not capped here.

🟢 Minor — _build_llm_chat base_url validation is a no-op on the live path (email_routes.py:117)
The endpoint never lets a caller set base_url, so production always calls _build_llm_chat() with base_url=None, and EmailAgentConfig(base_url=None).validate() short-circuits (config.py:127 if self.base_url:). The real HTTP-surface protection is the hard-pinned use_claude/use_chatgpt=False; the validate() call only guards direct internal callers and the tests. Fine as defense-in-depth — just noting the allowlist check isn't what protects the REST path, in case the comment ("validated against the AC3 allowlist at call time") gives a false sense of where the guarantee comes from.

Strengths

Default path is provably unchanged — test_heuristic_result_matches_baseline asserts category/spam/phishing/summary equality against a pre-change run, which is exactly the right way to prove "no added latency, no LLM loaded."
Real reuse, no duplication — the LLM path delegates to the agent's existing classify_email_llm / summarize_email_llm and shares _build_draft / _extract_action_items, and mirrors the heuristic single/thread structure (same label_ids=[], same summary_prefix) one-for-one.
Layered AC3 posture — hard-pinned cloud flags + base_url allowlist + an AST check for module-level cloud imports + an egress detector that raises on any real HTTP. Even with the wiring gap above, the intent and the runtime behavior are sound, and the cloud-base_url-rejection test is parametrized across all four cloud hosts.
Clean 422 semantics — using a FastAPI Literal["heuristic","llm"] query param means an unknown engine is a validation error, not a 500, and the service-level ValueError is a sensible belt-and-suspenders for direct callers.

Verdict

Approve with suggestions. No correctness or security blocker — the local-only guarantee holds at runtime and is unit-tested in CI. The two 🟡 items are worth doing before merge: wire the static gate / egress test so AC3 is enforced rather than documented, and translate engine=llm LLM failures into a structured boundary error matching the existing 503 pattern in this file. The 🟢 items are quick cleanups.

Closes #1539 > **Stacked on #1547** (base branch `feat/email-llm-triage-api-1452`). Review after #1547 — the diff here is only the `message_id` echo. ## Why this matters Before: the triage API accepted `message_id` on input but never echoed it in the response, so a consuming app couldn't correlate a result back to its input or dedupe — it would re-triage the same emails on every polling loop. After: `EmailTriageResult` echoes the input's identifying id — `message.message_id` for a single email, `thread_id` for a thread — across both engines and both single/thread paths. The consumer caches by this id to dedupe. The field is `Optional`, so the frozen #1262 shape stays backward-compatible (existing callers that omit it still validate). Per the AC, consumer correlation is the documented echoed-id route; no stateful server-side cache is added. ## Test plan - [x] Unit: result echoes `message_id` for single (heuristic + llm) and thread; validates without the field (backward-compat); sample payloads parse. **374 passed.** - [x] Lint: black + isort pass. - [x] Real-world (Linux + Windows): the live API echoes the id — evidence below. ## Real-world evidence (live `gaia api start`, engine=heuristic; branch + field verified before testing) | OS | HEAD | single (`message_id=rw-single-1`) | thread (`thread_id=rw-thread-7`) | |----|------|-----------------------------------|----------------------------------| | Linux — t-nx-strx-halo | 6181908 | result.message_id = `rw-single-1` | result.message_id = `rw-thread-7` | | Windows — t-win-radeon | 6181908 | result.message_id = `rw-single-1` | result.message_id = `rw-thread-7` | Both OSes: single echoes the input `message.message_id`, thread echoes `thread_id`. Boxes restored to their original branches after the run.

… off the event loop (#1452) The engine=llm path did blocking HTTP I/O to Lemonade on uvicorn's event loop and let LLMTriageError / EmailSummarizeError bubble up as bare 500s. Wrap the synchronous service call with asyncio.to_thread so the loop stays free during model inference; catch both specific failure classes and re-raise as HTTPException(502) with an actionable detail so callers can distinguish an LLM-infra error from a server bug.

github-actions · 2026-06-10T15:31:19Z

🟡 email_routes.py:302 — summarize_email_llm is called unconditionally, contradicting the documented contract.

Both the class docstring and the module-level doc say:

"when [heuristic] returns confident=False, classify_email_llm and summarize_email_llm are invoked"

But _build_result_llm only guards the classify call:

if heuristic.confident:
    category = EmailCategory(heuristic.category)   # LLM classify skipped
else:
    llm_result = classify_email_llm(chat, ...)
    category = EmailCategory(llm_result["category"])

llm_summary = summarize_email_llm(chat, ...)       # ← always called

High-confidence spam/promo emails will incur full LLM summarisation latency even though the heuristic already resolved them. test_high_confidence_heuristic_skips_llm_classify only asserts "classify" not in chat.calls and never checks whether "summarize" was correctly skipped — so this gap is untested.

If the intent is "summarise via LLM always when engine=llm", fix the docstring to say so. If the intent is to skip both on high-confidence results, guard the summarise call the same way:

        if heuristic.confident:
            category = EmailCategory(heuristic.category)
            llm_summary = self._summarize(subject, body)  # heuristic path
        else:
            llm_result = classify_email_llm(chat, subject=subject, sender=sender_raw, body=body)
            category = EmailCategory(llm_result["category"])
            llm_summary = summarize_email_llm(chat, subject=subject, sender=sender_raw, body=body)

…fits (#1452) The API engine=llm path 502'd on every call: _build_llm_chat capped output at max_tokens=512, but Gemma-4 emits a long reasoning preamble before the JSON, so classify_email_llm received truncated output with no parseable object. Caught by real-world testing on a live Lemonade backend (the mocked unit tests could not).

github-actions · 2026-06-10T18:48:42Z

🟡 contract.py:238, email_routes.py — issue-number references in docstrings violate CLAUDE.md

Three places in this push add issue numbers inline:

contract.py:238 — "Amendment (#1539): …" and "frozen #1262 shape" in the class docstring
email_routes.py module docstring — "Engine selection (#1452)"
triage_request Args section — "The frozen #1262 contract request."

CLAUDE.md is explicit: "Patterns like 'Pre-#1030 follow-up…' belong in the PR description and commit body. Inline they rot as soon as the code moves." The technical content is useful; just strip the (#NNNN) tokens and the history clause.

class EmailTriageResult(_Strict):
    """The structured analysis of a single email or thread.

    ``message_id`` is optional and echoes the identifying id of the input —
    ``message.message_id`` for a single email, ``thread_id`` for a thread.
    The field is ``Optional`` for backward compatibility: existing callers
    that omit it continue to validate without change.
    """

itomek added 3 commits June 9, 2026 09:40

github-actions Bot added the tests Test changes label Jun 9, 2026

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread tests/unit/agents/email/test_llm_engine_triage.py Fixed

test(email): assert AC3 error markers instead of host substring (CodeQL)

2a0f313

This was referenced Jun 9, 2026

feat(email): echo input message_id in triage result #1551

Merged

docs(email): Milestone 40 (v0.21.0) execution-order spec (sequencing + rationale) #1319

Open

itomek marked this pull request as ready for review June 10, 2026 14:41

itomek requested a review from kovtcharov-amd as a code owner June 10, 2026 14:41

itomek added 2 commits June 10, 2026 11:13

github-actions Bot added the agents label Jun 10, 2026

itomek force-pushed the feat/email-llm-triage-api-1452 branch from a176ff0 to 054bebb Compare June 10, 2026 19:20

Merge branch 'main' into feat/email-llm-triage-api-1452

f773bf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(email): expose LLM triage/summary via /v1/email/triage#1547

feat(email): expose LLM triage/summary via /v1/email/triage#1547
itomek wants to merge 8 commits into
mainfrom
feat/email-llm-triage-api-1452

itomek commented Jun 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itomek commented Jun 9, 2026

Why this matters

Test plan

Real-world evidence

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Review — feat(email): expose LLM triage/summary via /v1/email/triage

Summary

Issues

Strengths

Verdict

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review — `feat(email): expose LLM triage/summary via /v1/email/triage`