Skip to content

feat(email): expose LLM triage/summary via /v1/email/triage#1547

Open
itomek wants to merge 8 commits into
mainfrom
feat/email-llm-triage-api-1452
Open

feat(email): expose LLM triage/summary via /v1/email/triage#1547
itomek wants to merge 8 commits into
mainfrom
feat/email-llm-triage-api-1452

Conversation

@itomek

@itomek itomek commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Closes #1452

Why this matters

Before: POST /v1/email/triage was heuristic-only — its own docstring said "No LLM is invoked." External consumers got rule-based categorization and a first-two-sentences summary, never the LLM-assisted triage the milestone actually built (#1107, #1266). The public API reached only half of what the agent can do.

After: an opt-in ?engine=llm runs the heuristic first and, when it's low-confidence, escalates the category via the agent's existing classify_email_llm and replaces the summary with summarize_email_llm — over the local Lemonade model only. The default ?engine=heuristic is byte-unchanged: no added latency, no LLM loaded, contract shape preserved. Email content never leaves the machine — cloud providers are hard-pinned off (use_claude=False, use_chatgpt=False) and a static gate (util/check_email_agent_local_only.py) enforces AC3.

Test plan

  • Unit/API (mocked chat): python -m pytest tests/unit/agents/email/ tests/integration/test_email_agent_local_only.py tests/test_api.py -qengine=llm returns an LLM category+summary; default engine=heuristic is byte-unchanged and makes no LLM call; a cloud base_url is rejected loudly; an invalid engine value → 422.
  • AC3 static gate: python util/check_email_agent_local_only.py exits 0.
  • Real-world (Linux, AMD): API server + live Lemonade — engine=llm on a low-confidence sample email returns an LLM-derived category/summary that differs from the heuristic baseline; no cloud egress. (evidence below)
  • Real-world (Windows, AMD): same. (evidence below)

Real-world evidence

Pending — Linux + Windows runs in progress; logs/screenshots to follow.

itomek added 3 commits June 9, 2026 09:40
…1452)

Four test classes covering the four acceptance criteria:
- AC1: engine=llm escalates low-confidence heuristic inputs via LLM
- AC2: default engine=heuristic is byte-unchanged, no LLM call
- AC3: cloud base_url raises loudly with no silent fallback
- AC4: enforcement artifact files exist and are importable
…1452)

Adds an opt-in ?engine=llm query param to the triage endpoint. When the
heuristic confidence is low, the category is escalated via
classify_email_llm and the summary is replaced by summarize_email_llm,
both calling the LOCAL Lemonade server. The default engine=heuristic
path is byte-unchanged. Cloud base_urls raise ConfigurationError loudly
(AC3). Also codifies util/check_email_agent_local_only.py as a static
lint gate for the local-only contract.
Type the ?engine query param as Literal["heuristic", "llm"] so FastAPI
rejects an unknown value with a clean 422 at the API boundary — the
partner app must not see a server error for a bad query param. The
service-level ValueError guard stays as defense-in-depth for direct
callers. Also drops an unused MagicMock import in the local-only
integration test.
@github-actions github-actions Bot added the tests Test changes label Jun 9, 2026
Comment thread tests/unit/agents/email/test_llm_engine_triage.py Fixed
@github-actions

Copy link
Copy Markdown
Contributor

Review — feat(email): expose LLM triage/summary via /v1/email/triage

Summary

This is a clean, well-scoped opt-in: ?engine=llm escalates only low-confidence heuristic results through the local model, and the default engine=heuristic path is genuinely untouched (verified by a byte-for-byte baseline test). Reuse is excellent — it threads the agent's existing classify_email_llm / summarize_email_llm / heuristic categorizer and reuses _build_draft + _extract_action_items rather than duplicating logic, so the LLM path mirrors the heuristic path exactly.

The one thing worth your attention before this can be cited as "AC3-enforced": the two dedicated enforcement artifacts don't actually run anywhere. The static gate (util/check_email_agent_local_only.py) is never invoked by CI — lint.yml only runs lint.py --all; the util/check_*.py path filter just re-triggers the lint job, it doesn't execute the script (contrast check_doc_links.py / check_doc_versions.py, which are explicitly run). And the new egress integration test isn't picked up by any workflow (integration jobs are file-scoped and this file isn't listed). The runtime guarantee still holds — cloud providers are hard-pinned off (use_claude=False, use_chatgpt=False) and the cloud-base_url rejection is unit-tested and runs via test_unit.yml — so this is an enforcement-by-test gap, not a security hole. But the PR sells the static gate as the AC3 mechanism, and right now it's documentation.

Issues

🟡 Important — AC3 enforcement artifacts are inert (not executed)
The static gate and the egress integration test are only asserted to exist / be importable; nothing runs them. test_static_lint_gate_module_is_importable (tests/unit/agents/email/test_llm_engine_triage.py:283) loads the module and checks for a callable, but never calls check()/main() nor asserts a 0 exit. So a future change that introduces a cloud import or a use_claude field would pass CI green.

Cheapest fix — make the existing unit test actually run the gate:

    def test_static_lint_gate_passes(self):
        import importlib.util

        spec = importlib.util.spec_from_file_location(
            "check_email_agent_local_only",
            _REPO_ROOT / "util" / "check_email_agent_local_only.py",
        )
        mod = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(mod)  # type: ignore[union-attr]
        assert mod.check() == 0, "AC3 static gate reported a violation"

For the egress integration test, either add a CI step that runs it or fold the egress-detector assertion into the unit file (it already mocks the chat client, so it needs no live backend and can live under tests/unit/). Without one of these, "a static gate enforces AC3" overstates what's wired up.

🟡 Important — engine=llm LLM failures surface as a bare 500 (src/gaia/api/email_routes.py:338)
classify_email_llm / summarize_email_llm raise LLMTriageError / EmailSummarizeError on transport failure, and _build_result_llm lets them propagate uncaught to the endpoint — so a transient local-Lemonade hiccup returns an opaque 500 Internal Server Error to the partner app. This file already establishes the right pattern at resolve_*_backend (email_routes.py:588): translate the boundary exception into a structured 503 with an actionable detail. Wrapping the engine=llm call the same way (e.g. 503 "local LLM unavailable — is Lemonade running?") keeps it loud and actionable, instead of loud-but-opaque. Comment-only — there are valid choices on the exact status code, so I'd rather you decide than hand you a diff.

🟢 Minor — high-confidence spam still pays an LLM summarize round-trip (email_routes.py:354)
When the heuristic is confident (spam/promotions), classify is correctly skipped — but summarize_email_llm is always called regardless. That matches the PR's "replaces the summary" wording, so it may be intentional; flagging only so you confirm you want an LLM call for mail that's already confidently spam.

🟢 Minor — misleading comment on max_tokens (email_routes.py:137)
max_tokens caps generation length, not prompt capacity, so "Keep tokens generous enough for the triage system prompt + body" reads backwards. Suggest: # Output budget for the triage summary; the prompt+body are input, not capped here.

🟢 Minor — _build_llm_chat base_url validation is a no-op on the live path (email_routes.py:117)
The endpoint never lets a caller set base_url, so production always calls _build_llm_chat() with base_url=None, and EmailAgentConfig(base_url=None).validate() short-circuits (config.py:127 if self.base_url:). The real HTTP-surface protection is the hard-pinned use_claude/use_chatgpt=False; the validate() call only guards direct internal callers and the tests. Fine as defense-in-depth — just noting the allowlist check isn't what protects the REST path, in case the comment ("validated against the AC3 allowlist at call time") gives a false sense of where the guarantee comes from.

Strengths

  • Default path is provably unchangedtest_heuristic_result_matches_baseline asserts category/spam/phishing/summary equality against a pre-change run, which is exactly the right way to prove "no added latency, no LLM loaded."
  • Real reuse, no duplication — the LLM path delegates to the agent's existing classify_email_llm / summarize_email_llm and shares _build_draft / _extract_action_items, and mirrors the heuristic single/thread structure (same label_ids=[], same summary_prefix) one-for-one.
  • Layered AC3 posture — hard-pinned cloud flags + base_url allowlist + an AST check for module-level cloud imports + an egress detector that raises on any real HTTP. Even with the wiring gap above, the intent and the runtime behavior are sound, and the cloud-base_url-rejection test is parametrized across all four cloud hosts.
  • Clean 422 semantics — using a FastAPI Literal["heuristic","llm"] query param means an unknown engine is a validation error, not a 500, and the service-level ValueError is a sensible belt-and-suspenders for direct callers.

Verdict

Approve with suggestions. No correctness or security blocker — the local-only guarantee holds at runtime and is unit-tested in CI. The two 🟡 items are worth doing before merge: wire the static gate / egress test so AC3 is enforced rather than documented, and translate engine=llm LLM failures into a structured boundary error matching the existing 503 pattern in this file. The 🟢 items are quick cleanups.

itomek added 2 commits June 10, 2026 11:13
Closes #1539

> **Stacked on #1547** (base branch `feat/email-llm-triage-api-1452`).
Review after #1547 — the diff here is only the `message_id` echo.

## Why this matters
Before: the triage API accepted `message_id` on input but never echoed
it in the response, so a consuming app couldn't correlate a result back
to its input or dedupe — it would re-triage the same emails on every
polling loop.

After: `EmailTriageResult` echoes the input's identifying id —
`message.message_id` for a single email, `thread_id` for a thread —
across both engines and both single/thread paths. The consumer caches by
this id to dedupe. The field is `Optional`, so the frozen #1262 shape
stays backward-compatible (existing callers that omit it still
validate). Per the AC, consumer correlation is the documented echoed-id
route; no stateful server-side cache is added.

## Test plan
- [x] Unit: result echoes `message_id` for single (heuristic + llm) and
thread; validates without the field (backward-compat); sample payloads
parse. **374 passed.**
- [x] Lint: black + isort pass.
- [x] Real-world (Linux + Windows): the live API echoes the id —
evidence below.

## Real-world evidence (live `gaia api start`, engine=heuristic; branch
+ field verified before testing)
| OS | HEAD | single (`message_id=rw-single-1`) | thread
(`thread_id=rw-thread-7`) |

|----|------|-----------------------------------|----------------------------------|
| Linux — t-nx-strx-halo | 6181908 | result.message_id = `rw-single-1`
| result.message_id = `rw-thread-7` |
| Windows — t-win-radeon | 6181908 | result.message_id = `rw-single-1`
| result.message_id = `rw-thread-7` |

Both OSes: single echoes the input `message.message_id`, thread echoes
`thread_id`. Boxes restored to their original branches after the run.
… off the event loop (#1452)

The engine=llm path did blocking HTTP I/O to Lemonade on uvicorn's event
loop and let LLMTriageError / EmailSummarizeError bubble up as bare 500s.

Wrap the synchronous service call with asyncio.to_thread so the loop stays
free during model inference; catch both specific failure classes and re-raise
as HTTPException(502) with an actionable detail so callers can distinguish
an LLM-infra error from a server bug.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 email_routes.py:302summarize_email_llm is called unconditionally, contradicting the documented contract.

Both the class docstring and the module-level doc say:

"when [heuristic] returns confident=False, classify_email_llm and summarize_email_llm are invoked"

But _build_result_llm only guards the classify call:

if heuristic.confident:
    category = EmailCategory(heuristic.category)   # LLM classify skipped
else:
    llm_result = classify_email_llm(chat, ...)
    category = EmailCategory(llm_result["category"])

llm_summary = summarize_email_llm(chat, ...)       # ← always called

High-confidence spam/promo emails will incur full LLM summarisation latency even though the heuristic already resolved them. test_high_confidence_heuristic_skips_llm_classify only asserts "classify" not in chat.calls and never checks whether "summarize" was correctly skipped — so this gap is untested.

If the intent is "summarise via LLM always when engine=llm", fix the docstring to say so. If the intent is to skip both on high-confidence results, guard the summarise call the same way:

        if heuristic.confident:
            category = EmailCategory(heuristic.category)
            llm_summary = self._summarize(subject, body)  # heuristic path
        else:
            llm_result = classify_email_llm(chat, subject=subject, sender=sender_raw, body=body)
            category = EmailCategory(llm_result["category"])
            llm_summary = summarize_email_llm(chat, subject=subject, sender=sender_raw, body=body)

…fits (#1452)

The API engine=llm path 502'd on every call: _build_llm_chat capped output at
max_tokens=512, but Gemma-4 emits a long reasoning preamble before the JSON, so
classify_email_llm received truncated output with no parseable object. Caught by
real-world testing on a live Lemonade backend (the mocked unit tests could not).
@github-actions

Copy link
Copy Markdown
Contributor

🟡 contract.py:238, email_routes.py — issue-number references in docstrings violate CLAUDE.md

Three places in this push add issue numbers inline:

  • contract.py:238"Amendment (#1539): …" and "frozen #1262 shape" in the class docstring
  • email_routes.py module docstring — "Engine selection (#1452)"
  • triage_request Args section — "The frozen #1262 contract request."

CLAUDE.md is explicit: "Patterns like 'Pre-#1030 follow-up…' belong in the PR description and commit body. Inline they rot as soon as the code moves." The technical content is useful; just strip the (#NNNN) tokens and the history clause.

class EmailTriageResult(_Strict):
    """The structured analysis of a single email or thread.

    ``message_id`` is optional and echoes the identifying id of the input —
    ``message.message_id`` for a single email, ``thread_id`` for a thread.
    The field is ``Optional`` for backward compatibility: existing callers
    that omit it continue to validate without change.
    """

@itomek itomek force-pushed the feat/email-llm-triage-api-1452 branch from a176ff0 to 054bebb Compare June 10, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(email): expose the agent's LLM triage/summary via /v1/email/triage (currently heuristic-only)

3 participants