docs(benchmark): write-up, charts + QA/translation tooling (split from #429)#430
docs(benchmark): write-up, charts + QA/translation tooling (split from #429)#430SantiagoDePolonia wants to merge 3 commits into
Conversation
The narrative and visuals for the June 2026 AWS gateway benchmark (ARTICLE.md,
cover.png + scripts/make_cover.py, charts/), plus two tools that are co-located in
the benchmark folder but are separate from the perf benchmark itself:
- qa/ a declarative quality/correctness suite (53 cases across dialects
and modalities, run against real providers through a gateway)
- translation/ a recording-mock harness comparing how each gateway translates the
same request
Split out from the benchmark PR (#429) so the core benchmark stays focused.
Opened as a draft pending a decision on whether/where this belongs in-repo.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Add ARTICLE2.md, the measured "Benchmarking AI Gateways" variant of the benchmark write-up, alongside the existing ARTICLE.md, plus its cover (cover-b.png) and generator (make_cover_b.py). Reuses the shared charts and cover.png already in this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
QA suite: isolate per-case errors (evaluate() now inside the try) and
support ${var} interpolation in expect blocks; assert conversation object
identity (get/update/delete/use_in_responses), batch-embedding ordering,
and a streaming usage record; drop non-primary "green" from the colors
oracle; coerce contains/not_contains operands to str; guard report
modality against non-list values.
Translation tooling: fail fast on a failed mock reset, reject unknown
--gateways values, pin peer gateway images by digest, escape AI-authored
Markdown cells, fix the GoModel port and a fenced-block language in the
README.
Write-up: clarify GoModel's open-source table cell ("Yes ‡").
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Draft - split out of #429 so the core benchmark PR stays focused. Decide separately whether/where this belongs in-repo.
Contains the parts of
docs/2026-06-25_aws_gateway_benchmark/that aren't the perf benchmark itself:ARTICLE.md(the blog narrative; note it duplicates the enterpilot.io post and will drift),cover.png+scripts/make_cover.py, and the four SVGcharts/.qa/- a declarative quality/correctness suite (53 cases across chat / responses / messages, streaming + non-streaming, plus audio/embeddings), run against real providers through a gateway.translation/- a recording-mock harness that compares how GoModel, LiteLLM, Portkey, and Bifrost translate the same request.The reproducible perf benchmark (harness,
RESULTS.md) and the refresheddocs/about/benchmarks.mdxare in #429.🤖 Generated with Claude Code