feat(email): expose LLM triage/summary via /v1/email/triage#1547
feat(email): expose LLM triage/summary via /v1/email/triage#1547itomek wants to merge 8 commits into
Conversation
…1452) Four test classes covering the four acceptance criteria: - AC1: engine=llm escalates low-confidence heuristic inputs via LLM - AC2: default engine=heuristic is byte-unchanged, no LLM call - AC3: cloud base_url raises loudly with no silent fallback - AC4: enforcement artifact files exist and are importable
…1452) Adds an opt-in ?engine=llm query param to the triage endpoint. When the heuristic confidence is low, the category is escalated via classify_email_llm and the summary is replaced by summarize_email_llm, both calling the LOCAL Lemonade server. The default engine=heuristic path is byte-unchanged. Cloud base_urls raise ConfigurationError loudly (AC3). Also codifies util/check_email_agent_local_only.py as a static lint gate for the local-only contract.
Type the ?engine query param as Literal["heuristic", "llm"] so FastAPI rejects an unknown value with a clean 422 at the API boundary — the partner app must not see a server error for a bad query param. The service-level ValueError guard stays as defense-in-depth for direct callers. Also drops an unused MagicMock import in the local-only integration test.
Review —
|
Closes #1539 > **Stacked on #1547** (base branch `feat/email-llm-triage-api-1452`). Review after #1547 — the diff here is only the `message_id` echo. ## Why this matters Before: the triage API accepted `message_id` on input but never echoed it in the response, so a consuming app couldn't correlate a result back to its input or dedupe — it would re-triage the same emails on every polling loop. After: `EmailTriageResult` echoes the input's identifying id — `message.message_id` for a single email, `thread_id` for a thread — across both engines and both single/thread paths. The consumer caches by this id to dedupe. The field is `Optional`, so the frozen #1262 shape stays backward-compatible (existing callers that omit it still validate). Per the AC, consumer correlation is the documented echoed-id route; no stateful server-side cache is added. ## Test plan - [x] Unit: result echoes `message_id` for single (heuristic + llm) and thread; validates without the field (backward-compat); sample payloads parse. **374 passed.** - [x] Lint: black + isort pass. - [x] Real-world (Linux + Windows): the live API echoes the id — evidence below. ## Real-world evidence (live `gaia api start`, engine=heuristic; branch + field verified before testing) | OS | HEAD | single (`message_id=rw-single-1`) | thread (`thread_id=rw-thread-7`) | |----|------|-----------------------------------|----------------------------------| | Linux — t-nx-strx-halo | 6181908 | result.message_id = `rw-single-1` | result.message_id = `rw-thread-7` | | Windows — t-win-radeon | 6181908 | result.message_id = `rw-single-1` | result.message_id = `rw-thread-7` | Both OSes: single echoes the input `message.message_id`, thread echoes `thread_id`. Boxes restored to their original branches after the run.
… off the event loop (#1452) The engine=llm path did blocking HTTP I/O to Lemonade on uvicorn's event loop and let LLMTriageError / EmailSummarizeError bubble up as bare 500s. Wrap the synchronous service call with asyncio.to_thread so the loop stays free during model inference; catch both specific failure classes and re-raise as HTTPException(502) with an actionable detail so callers can distinguish an LLM-infra error from a server bug.
|
🟡 Both the class docstring and the module-level doc say:
But if heuristic.confident:
category = EmailCategory(heuristic.category) # LLM classify skipped
else:
llm_result = classify_email_llm(chat, ...)
category = EmailCategory(llm_result["category"])
llm_summary = summarize_email_llm(chat, ...) # ← always calledHigh-confidence spam/promo emails will incur full LLM summarisation latency even though the heuristic already resolved them. If the intent is "summarise via LLM always when |
…fits (#1452) The API engine=llm path 502'd on every call: _build_llm_chat capped output at max_tokens=512, but Gemma-4 emits a long reasoning preamble before the JSON, so classify_email_llm received truncated output with no parseable object. Caught by real-world testing on a live Lemonade backend (the mocked unit tests could not).
|
🟡 Three places in this push add issue numbers inline:
CLAUDE.md is explicit: "Patterns like 'Pre-#1030 follow-up…' belong in the PR description and commit body. Inline they rot as soon as the code moves." The technical content is useful; just strip the |
a176ff0 to
054bebb
Compare
Closes #1452
Why this matters
Before:
POST /v1/email/triagewas heuristic-only — its own docstring said "No LLM is invoked." External consumers got rule-based categorization and a first-two-sentences summary, never the LLM-assisted triage the milestone actually built (#1107, #1266). The public API reached only half of what the agent can do.After: an opt-in
?engine=llmruns the heuristic first and, when it's low-confidence, escalates the category via the agent's existingclassify_email_llmand replaces the summary withsummarize_email_llm— over the local Lemonade model only. The default?engine=heuristicis byte-unchanged: no added latency, no LLM loaded, contract shape preserved. Email content never leaves the machine — cloud providers are hard-pinned off (use_claude=False,use_chatgpt=False) and a static gate (util/check_email_agent_local_only.py) enforces AC3.Test plan
python -m pytest tests/unit/agents/email/ tests/integration/test_email_agent_local_only.py tests/test_api.py -q—engine=llmreturns an LLM category+summary; defaultengine=heuristicis byte-unchanged and makes no LLM call; a cloudbase_urlis rejected loudly; an invalidenginevalue → 422.python util/check_email_agent_local_only.pyexits 0.engine=llmon a low-confidence sample email returns an LLM-derived category/summary that differs from the heuristic baseline; no cloud egress. (evidence below)Real-world evidence
Pending — Linux + Windows runs in progress; logs/screenshots to follow.