AI Observability

Benchmarking AI observability using a minimal RAG application. Same app, different instrumentation per experiment — compare what each approach captures.

Architecture

graph LR
    User --> experiments

    subgraph experiments["experiments/"]
        direction TB
        otel --- openllmetry --- openllmetry_manual --- bifrost --- portkey --- more_exp[...]
    end

    subgraph gateways["AI Gateways"]
        direction TB
        none_gw[none] --- bifrost_gw[bifrost] --- portkey_gw[portkey] --- more_gw[...]
    end

    subgraph sinks["Sinks"]
        direction TB
        subgraph grafana_stack["Grafana stack"]
            grafana[Grafana] --- prometheus[Prometheus] --- loki[Loki] --- tempo[Tempo]
        end
        grafana_stack --- signoz[SigNoz] --- more_sink[...]
    end

    experiments --> gateways
    experiments -->|OTLP| collector[OTel Collector Gateway]
    gateways -->|OTLP| collector
    collector --> sinks

    linkStyle 1,2,3,4,5,6,7,8,9,10,11,12,13 stroke:none

Each box is an independent silo. You can add a new instrumentation library, a new gateway, or a new sink without touching the others.

Experiments

Infrastructure

Shared infra (pgvector, OTel collector gateway, sinks) lives in infra/.

Quick links:

Topic	Link
Central `.env` config	infra usage
Grafana/Loki/Tempo/Prometheus stack	Grafana stack
Bifrost AI gateway	Bifrost gateway
Generate Bifrost virtual key	Virtual key instructions
Bifrost-specific notes	infra/bifrost README

Personas

Persona	What they care about
Platform/SRE	Is the service up? Is it slow?
FinOps	How much are we spending on LLMs? Per user? Per model?
ML/AI Engineer	Is the RAG pipeline working correctly? Are retrievals relevant?
Product Manager	How long do users wait for answers?
Security/Compliance	What data is being sent to LLMs?

Observable surfaces

Layer	What's observable
HTTP/API	Request latency, status codes, route-level metrics
RAG/Vector DB	Embedding calls, pgvector query latency, retrieval similarity scores
LLM	Token usage, model, prompt/completion content, generation latency

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
base		base
experiments		experiments
infra		infra
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Order	Experiment	What it demonstrates	README
—	`base/`	Uninstrumented RAG app (source of truth)	README
1	`experiments/otel`	Vanilla OTel: manual spans, metrics, logs	README
2	`experiments/openllmetry`	OpenLLMetry auto-instruments OpenAI SDK (tokens, model, prompts for free)	README
3	`experiments/openllmetry_manual`	OpenLLMetry + manual spans (retrieval quality, per-user attribution)	README
4	`experiments/bifrost`	Bifrost AI gateway captures provider/model/token telemetry outside the app	README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Observability

Architecture

Experiments

Infrastructure

Personas

Observable surfaces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Observability

Architecture

Experiments

Infrastructure

Personas

Observable surfaces

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages