Benchmarking AI observability using a minimal RAG application. Same app, different instrumentation per experiment — compare what each approach captures.
graph LR
User --> experiments
subgraph experiments["experiments/"]
direction TB
otel --- openllmetry --- openllmetry_manual --- bifrost --- portkey --- more_exp[...]
end
subgraph gateways["AI Gateways"]
direction TB
none_gw[none] --- bifrost_gw[bifrost] --- portkey_gw[portkey] --- more_gw[...]
end
subgraph sinks["Sinks"]
direction TB
subgraph grafana_stack["Grafana stack"]
grafana[Grafana] --- prometheus[Prometheus] --- loki[Loki] --- tempo[Tempo]
end
grafana_stack --- signoz[SigNoz] --- more_sink[...]
end
experiments --> gateways
experiments -->|OTLP| collector[OTel Collector Gateway]
gateways -->|OTLP| collector
collector --> sinks
linkStyle 1,2,3,4,5,6,7,8,9,10,11,12,13 stroke:none
Each box is an independent silo. You can add a new instrumentation library, a new gateway, or a new sink without touching the others.
Recommended reading order:
| Order | Experiment | What it demonstrates | README |
|---|---|---|---|
| — | base/ |
Uninstrumented RAG app (source of truth) | README |
| 1 | experiments/otel |
Vanilla OTel: manual spans, metrics, logs | README |
| 2 | experiments/openllmetry |
OpenLLMetry auto-instruments OpenAI SDK (tokens, model, prompts for free) | README |
| 3 | experiments/openllmetry_manual |
OpenLLMetry + manual spans (retrieval quality, per-user attribution) | README |
| 4 | experiments/bifrost |
Bifrost AI gateway captures provider/model/token telemetry outside the app | README |
Shared infra (pgvector, OTel collector gateway, sinks) lives in infra/.
Quick links:
| Topic | Link |
|---|---|
Central .env config |
infra usage |
| Grafana/Loki/Tempo/Prometheus stack | Grafana stack |
| Bifrost AI gateway | Bifrost gateway |
| Generate Bifrost virtual key | Virtual key instructions |
| Bifrost-specific notes | infra/bifrost README |
| Persona | What they care about |
|---|---|
| Platform/SRE | Is the service up? Is it slow? |
| FinOps | How much are we spending on LLMs? Per user? Per model? |
| ML/AI Engineer | Is the RAG pipeline working correctly? Are retrievals relevant? |
| Product Manager | How long do users wait for answers? |
| Security/Compliance | What data is being sent to LLMs? |
| Layer | What's observable |
|---|---|
| HTTP/API | Request latency, status codes, route-level metrics |
| RAG/Vector DB | Embedding calls, pgvector query latency, retrieval similarity scores |
| LLM | Token usage, model, prompt/completion content, generation latency |