You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Should OpenCHAMI evolve magellan from a discovery CLI into a persistent stateless service that owns all BMC interactions — continuous discovery (replacing SMD's internal loop), FRU-tracker collection, firmware updates with local HTTP image staging, BIOS configuration, BMC settings (NTP/Syslog/SSH keys per magellan #129), power operations, console URI lookup — while shielding vendor-specific BMC inconsistencies from the rest of the infrastructure and coordinating with upstream services' state machines? And in doing so, supersede #123 by broadening its plugin-selection scope into a complete service architecture.
Category
Architecture
Stakeholders / Affected Areas
The magellan maintainers (scope expansion); the SMD / inventory-service track (loses the internal discovery loop, gains a downstream writer); power-control (today calls Redfish directly — would delegate); remote-console (today depends on SMD's parsed manager data, broken by smd #91 — would call magellan); fru-tracker (today event-fed — would consume magellan's discovery output); ex-bootstrap (today does direct Redfish for cabinet discovery and firmware triggers — candidate for consolidation); the #91 lifecycle state machine (magellan becomes a participant); the #128 Vault credential flow (magellan is the natural Vault consumer for BMC creds); BMC vendors (HPE, Dell, Supermicro, Cray, Lenovo — the abstraction layer must accommodate them); cluster operators (firmware update workflows in particular).
Decision Needed By
Coordinate with the SMD-replacement RFD (whichever lands) and #91 state machine. Magellan's discovery scope is the dependency for SMD's internal loop being retired; the state-machine pattern affects magellan's RPC vs async surface. Realistic target: in-discussion in the next 1–2 TSC meetings.
Problem Statement
BMC interactions in OpenCHAMI today are fragmented across 4–5 services, each implementing its own Redfish handling and learning vendor quirks independently. This shows up in concrete ways:
SMD owns the internal Redfish discovery loop and webhook ingestion. The V2 parser bug smd #91 drops Manager ComponentEndpoint records — and the downstream consequence is that remote-console can't read RedfishManagerInfo.CommandShell and console access breaks. One parser bug, multiple services degraded.
power-control calls Redfish directly for power state and reset operations.
remote-console reads parsed manager metadata from SMD — i.e., depends on SMD getting the parsing right.
ex-bootstrap calls Redfish directly for cabinet discovery via /Managers/BMC/EthernetInterfaces and has a firmware-update trigger stub.
fru-tracker is event-fed from external collectors; no direct BMC interaction but depends on whoever produces the DiscoverySnapshot payload getting Redfish right.
magellan today is a CLI tool: scan/crawl/collect/power/update/login/secrets. Single-shot operation; no daemon mode; vendor-specific handling is thin (generic via gofish, plus jaws DIMM quirks). The update command is stubbed but doesn't have an execution engine; no local HTTP server for firmware image staging.
There is also a strategic gap: capabilities OpenCHAMI clearly needs but doesn't have a home for — firmware update execution with local HTTP staging, BIOS configuration, BMC settings management per magellan #129, continuous discovery to replace SMD's loop. Adding each of these to a different existing service compounds the fragmentation. Adding them all to the existing magellan-as-CLI doesn't work because they need a persistent execution model and state-machine coordination.
#123 Redfish Interface Strategy was opened to address vendor inconsistency via plugin-based Redfish client selection. It's the right direction but the wrong scope: it proposes plugin selection for discovery clients but is silent on (a) the persistent-service question, (b) firmware updates, (c) credential lifecycle, (d) state-machine coordination, and (e) IPMI fallback. The plugin idea is good; this RFD generalizes it into a service architecture.
The architectural opportunity: one service owns BMC interaction; vendor quirks are absorbed there; the rest of the infrastructure consumes a uniform internal API; the security perimeter around BMC credentials becomes small and well-defined; the microservice mesh simplifies (this RFD is the BMC-connected sibling of the SMD-replacement RFD's BMC-disconnected plane).
Proposed Solution
Evolve magellan along three axes simultaneously: execution model (CLI → persistent service), scope (discovery → all BMC interaction), and abstraction (generic Redfish → vendor-shielded interface).
Execution model: persistent stateless service
Daemon: long-running process with a stable API surface (gRPC and/or REST), suitable for Kubernetes Deployment or systemd unit.
Retain CLI subcommands for operator-driven workflows; CLI calls into the daemon's API or runs standalone for the simplest use cases.
Stateless: no database. Sources of truth live in inventory-service (components, FRU, endpoints, state), fru-tracker (FRU device hierarchy), Vault (credentials per #128). Magellan caches as needed but loses no state on restart.
Health, metrics (Prometheus), structured logging, graceful shutdown are baseline.
Internal abstraction layer: magellan exposes a uniform internal API (e.g., BMCClient.PowerOn(ctx, endpoint, opts)) regardless of vendor. The vendor-specific dispatch happens inside magellan.
Vendor plugins (internal first): per-vendor handling for HPE, Dell, Supermicro, Cray, Lenovo lives in pkg/bmc/vendor/{vendor}/ initially. Plugin interface stabilizes over time.
Detection: vendor + model + firmware version detected via /redfish/v1 metadata at first contact; cached per-endpoint.
Fallback: generic Redfish client (current gofish-based) handles unknown vendors; quirks fail loudly so users know to file a vendor support PR.
Quirk examples the abstraction needs to absorb (from research):
HPE / Dell differ on firmware UpdateService Actions and HTTP/HTTPS requirements.
Power operations differ on ETag handling and reset action shapes.
BIOS configuration lives in vendor OEM namespaces (Oem.Hpe, Oem.Dell, Oem.Supermicro).
Manager metadata layout (the root cause of smd #91).
State-machine coordination
Magellan as a daemon can be a first-class participant in upstream state machines (per #91):
Blocking RPC for synchronous transitions where the caller needs the result before proceeding (e.g., power-control: "power on and wait until BMC reports Powered").
Async with callbacks for long-running operations (firmware update: caller submits work; magellan reports completion via webhook to the state-machine engine).
State-aware validation (optional): magellan can refuse operations the state machine has not authorized (e.g., reject firmware update if node is not in firmware_update state). Whether this lives in magellan or in the state-machine engine is an open question.
Audit trail: every BMC operation logs who requested it, when, what happened. Magellan is the single point where this gets captured.
Firmware update — OCI/ORAS artifacts + local HTTP server
The firmware-management piece becomes a thin shim over an OCI registry. Each firmware bundle is published to the registry as an OCI artifact via ORAS — using the existing OCI primitives (manifests, layers/blobs, descriptors, annotations, content-addressed digests) to get immutability, supply-chain attachments, and a vendor-agnostic distribution model "for free."
Firmware bundle artifact taxonomy
A firmware bundle is identified by a content-addressed digest (immutable identity) plus optional tags (mutable convenience labels) plus annotations (searchable description). Example shape:
Because these are custom artifact types (not container image types), a docker pull against them is refused — no risk of accidental interference between firmware bundles and the org's regular container images in the same registry. The publisher pipeline can attach SBOMs/attestations the firmware service doesn't yet read; a separate CI job can re-check those layers and update labels/tags the service does read. Publisher and consumer are decoupled by the registry.
Update flow
What magellan actually does, with the OCI/ORAS layer present:
Resolve the bundle: caller supplies a reference (registry URL + tag or digest). Magellan resolves the tag to a digest at lookup time so the rest of the operation is pinned to immutable content.
Policy check via attached artifacts: optionally walk the referrers — verify the cosign signature, check SBOM presence, confirm a vendor-certified test report exists, evaluate site policy. Failures abort the update before any BMC contact.
Selective ORAS pull: fetch only the payload layer (vnd.openchami.firmware.payload.v1), not the full bundle. Bandwidth-friendly when bundles contain release notes, SBOMs, etc.
Stage on local HTTP server: ephemeral configurable-port server; per-update; teardown on completion. Default plain HTTP with a one-shot path token in the URL (most BMCs don't validate TLS certs); configurable to TLS where required.
Trigger the BMC: POST UpdateService.SimpleUpdate (or MultipartHttpPushUpdate per vendor capability detection) with ImageURI pointing at the local server. The BMC pulls; magellan polls UpdateService for progress; reports state to the state-machine engine; handles timeouts and retries.
Tear down: stop the HTTP server, evict the payload from local cache (or retain per cache policy), report completion to the state machine.
Why this is meaningfully different from "magellan with a blob cache"
Immutability and content-addressing: the digest is the identity. No "wait, which build of firmware-v2.14.7 was that?" — the digest answers exactly.
Supply-chain extensibility for free: cosign-signed today, SBOM-attached tomorrow, in-toto provenance the day after — without ever modifying magellan. The artifact grows new attachments; the service grows new readers when it's ready.
Vendor publishing decoupled: HPE/Dell/etc. push to a registry; magellan reads. No bespoke distribution channel.
Same primitives the rest of the cloud-native ecosystem uses: anyone who knows OCI/registry mechanics already understands the firmware library.
Alternatives Considered
Continue with [RFD]: Redfish Interface Strategy #123 as-is. Narrow but correct on its own terms. Doesn't address the consolidation, firmware-update, or state-machine questions; doesn't make magellan a persistent service. The plugin idea survives in this RFD; the rest of [RFD]: Redfish Interface Strategy #123's scope is too small.
Keep BMC interactions distributed across services (status quo). Each service learns vendor quirks independently. smd #91-style cascading failures continue. Adding new capabilities (firmware update, BIOS config) creates new homes; mesh gets more complex, not less.
Per-vendor service (one daemon per BMC vendor). Operationally heavy: scale grows linearly with vendor diversity; coordination cross-vendor (e.g., a heterogeneous cluster) becomes a separate problem. Plugin model inside one service is the better factoring.
Library, not service. Magellan as a Go library imported by power-control, remote-console, fru-tracker, etc. Loses the central credential cache, the state-machine integration point, and the audit trail. Each service becomes its own BMC-talking daemon. Less consolidation; more places for vendor quirks to live.
Outsource to a third-party (e.g., OpenStack Ironic for BMC interaction). Tightly couples OpenCHAMI to OpenStack's runtime; brings in much more than the BMC layer needs. Possible co-existence with OpenCHAMI if Ironic is already deployed, but not a substitute for an OpenCHAMI-native BMC service.
Other Considerations
Migration sequencing matters. Magellan must reach feature parity with SMD's discovery loop before SMD's loop is turned off — otherwise discovery goes dark. Phased plan: (1) magellan-as-daemon stands up alongside SMD's loop; (2) downstream services migrate one at a time (power-control → magellan, remote-console → magellan, etc.); (3) SMD's loop retires when no one's left consuming it.
Coordination with the SMD-replacement RFD. This RFD is the BMC-connected sibling of the SMD-replacement RFD's BMC-disconnected plane. Both must land in compatible shapes — if SMD-replacement chooses spec-first (Option C in that RFD), the magellan-to-inventory write interface becomes part of that spec.
Event subscriptions: deferred decision. SMD does this today. Whether the replacement architecture needs Redfish webhook ingestion at all — and if so, whether it lives in magellan or in a separate event-ingestion service — is undecided. The default assumption is "yes, it lives in magellan" because magellan is the BMC-connected service; but polling can substitute for many use cases and is more reliable across firewall/reboot/subscriber-crash failures.
Firmware artifact storage is the OCI registry, with ORAS handling the non-container artifact mechanics (custom artifact types, layer media types, manifest annotations, referrer attachments). Magellan caches recently-used payloads on disk per a configurable policy. This is distinct from #129 Artifact Library Service — that RFD is about S3-shaped object storage (operator-staged OS images, kickstart files, etc.); the firmware concern is OCI-registry-shaped (immutable, content-addressed, with referrer attachments for SBOMs/signatures). Both can coexist; they solve different distribution-shape problems. The org doesn't yet have a designated OCI registry — that's a sequencing dependency for this RFD's firmware portion (Harbor, Distribution, GHCR, ECR, and others are all candidates; choice is out of scope here).
Local HTTP server security. BMCs rarely validate TLS certs. Default to plain HTTP with a one-shot path token in the URL; configurable to TLS with imported cert chains for environments that require it.
Plugin maintenance. Same governance question as #123: in-tree vendor plugins (OpenCHAMI maintains) vs. external (vendors contribute). Recommend in-tree first for the common vendors (HPE, Dell, Supermicro, Cray, Lenovo); extensibility for site-specific or rare-vendor plugins via a versioned plugin interface.
Backward compatibility.ex-bootstrap, power-control, and remote-console work today by calling Redfish or reading SMD. Phased migration to magellan calls. Old call paths supported during transition; deprecated cleanly.
Failure recovery. Magellan crashes mid-operation. The state machine on the outside needs to handle this — magellan doesn't checkpoint. Operations need clear idempotency semantics so retries are safe.
ORAS — the tool/library that makes OCI registries usable for non-container artifacts (custom artifact types, layer media types, referrer attachments).
OCI Distribution Spec — the registry protocol; in particular the Referrers API which gives us the extensibility path for SBOMs / signatures / attestations attached to firmware bundles.
Cosign — signs anything in a registry; the natural choice for firmware-bundle signing alongside the existing gpg-signing-manager RPM-signing pattern.
SPDX, in-toto — the SBOM / provenance formats the OCI referrer attachments would carry.
The SMD-replacement RFD draft (in vault at Drafts/RFD - SMD replacement.md) — the BMC-disconnected sibling. Together these two RFDs define the architectural split between BMC-connected and BMC-disconnected planes.
Research backing this draft: scratch/bmc-interaction/research.md — full capability inventory, vendor-inconsistency examples, state-machine integration patterns, open-question list.
Decision Goal
Should OpenCHAMI evolve
magellanfrom a discovery CLI into a persistent stateless service that owns all BMC interactions — continuous discovery (replacing SMD's internal loop), FRU-tracker collection, firmware updates with local HTTP image staging, BIOS configuration, BMC settings (NTP/Syslog/SSH keys per magellan #129), power operations, console URI lookup — while shielding vendor-specific BMC inconsistencies from the rest of the infrastructure and coordinating with upstream services' state machines? And in doing so, supersede #123 by broadening its plugin-selection scope into a complete service architecture.Category
Architecture
Stakeholders / Affected Areas
The
magellanmaintainers (scope expansion); the SMD /inventory-servicetrack (loses the internal discovery loop, gains a downstream writer);power-control(today calls Redfish directly — would delegate);remote-console(today depends on SMD's parsed manager data, broken by smd #91 — would call magellan);fru-tracker(today event-fed — would consume magellan's discovery output);ex-bootstrap(today does direct Redfish for cabinet discovery and firmware triggers — candidate for consolidation); the #91 lifecycle state machine (magellan becomes a participant); the #128 Vault credential flow (magellan is the natural Vault consumer for BMC creds); BMC vendors (HPE, Dell, Supermicro, Cray, Lenovo — the abstraction layer must accommodate them); cluster operators (firmware update workflows in particular).Decision Needed By
Coordinate with the SMD-replacement RFD (whichever lands) and #91 state machine. Magellan's discovery scope is the dependency for SMD's internal loop being retired; the state-machine pattern affects magellan's RPC vs async surface. Realistic target: in-discussion in the next 1–2 TSC meetings.
Problem Statement
BMC interactions in OpenCHAMI today are fragmented across 4–5 services, each implementing its own Redfish handling and learning vendor quirks independently. This shows up in concrete ways:
ComponentEndpointrecords — and the downstream consequence is thatremote-consolecan't readRedfishManagerInfo.CommandShelland console access breaks. One parser bug, multiple services degraded.power-controlcalls Redfish directly for power state and reset operations.remote-consolereads parsed manager metadata from SMD — i.e., depends on SMD getting the parsing right.ex-bootstrapcalls Redfish directly for cabinet discovery via/Managers/BMC/EthernetInterfacesand has a firmware-update trigger stub.fru-trackeris event-fed from external collectors; no direct BMC interaction but depends on whoever produces theDiscoverySnapshotpayload getting Redfish right.magellantoday is a CLI tool:scan/crawl/collect/power/update/login/secrets. Single-shot operation; no daemon mode; vendor-specific handling is thin (generic viagofish, plus jaws DIMM quirks). Theupdatecommand is stubbed but doesn't have an execution engine; no local HTTP server for firmware image staging.There is also a strategic gap: capabilities OpenCHAMI clearly needs but doesn't have a home for — firmware update execution with local HTTP staging, BIOS configuration, BMC settings management per magellan #129, continuous discovery to replace SMD's loop. Adding each of these to a different existing service compounds the fragmentation. Adding them all to the existing magellan-as-CLI doesn't work because they need a persistent execution model and state-machine coordination.
#123 Redfish Interface Strategy was opened to address vendor inconsistency via plugin-based Redfish client selection. It's the right direction but the wrong scope: it proposes plugin selection for discovery clients but is silent on (a) the persistent-service question, (b) firmware updates, (c) credential lifecycle, (d) state-machine coordination, and (e) IPMI fallback. The plugin idea is good; this RFD generalizes it into a service architecture.
The architectural opportunity: one service owns BMC interaction; vendor quirks are absorbed there; the rest of the infrastructure consumes a uniform internal API; the security perimeter around BMC credentials becomes small and well-defined; the microservice mesh simplifies (this RFD is the BMC-connected sibling of the SMD-replacement RFD's BMC-disconnected plane).
Proposed Solution
Evolve magellan along three axes simultaneously: execution model (CLI → persistent service), scope (discovery → all BMC interaction), and abstraction (generic Redfish → vendor-shielded interface).
Execution model: persistent stateless service
inventory-service(components, FRU, endpoints, state),fru-tracker(FRU device hierarchy), Vault (credentials per #128). Magellan caches as needed but loses no state on restart.Scope: all BMC interactions
Per the research's capability inventory:
Out of scope (explicit non-goals):
boot-service.Abstraction: vendor-shielded interface
The plugin idea from #123 lives on, generalized:
BMCClient.PowerOn(ctx, endpoint, opts)) regardless of vendor. The vendor-specific dispatch happens inside magellan.pkg/bmc/vendor/{vendor}/initially. Plugin interface stabilizes over time./redfish/v1metadata at first contact; cached per-endpoint.gofish-based) handles unknown vendors; quirks fail loudly so users know to file a vendor support PR.Oem.Hpe,Oem.Dell,Oem.Supermicro).State-machine coordination
Magellan as a daemon can be a first-class participant in upstream state machines (per #91):
firmware_updatestate). Whether this lives in magellan or in the state-machine engine is an open question.Firmware update — OCI/ORAS artifacts + local HTTP server
The firmware-management piece becomes a thin shim over an OCI registry. Each firmware bundle is published to the registry as an OCI artifact via ORAS — using the existing OCI primitives (manifests, layers/blobs, descriptors, annotations, content-addressed digests) to get immutability, supply-chain attachments, and a vendor-agnostic distribution model "for free."
Firmware bundle artifact taxonomy
A firmware bundle is identified by a content-addressed digest (immutable identity) plus optional tags (mutable convenience labels) plus annotations (searchable description). Example shape:
Manifest:
{ "artifactType": "application/vnd.openchami.firmware.bundle.v1+json", "annotations": { "org.opencontainers.image.title": "cray-ex-node-bmc", "org.opencontainers.image.version": "2.14.7", "org.opencontainers.image.vendor": "HPE", "org.opencontainers.image.created": "2026-05-28T00:00:00Z", "org.openchami.firmware.component": "bmc", "org.openchami.firmware.platform": "cray-ex", "org.openchami.firmware.hw-revision": "rev-a", "org.openchami.firmware.compatibility":"vendor-certified" } }Layers (per-file media types — exact names TBD but the pattern is the point):
firmware.binapplication/vnd.openchami.firmware.payload.v1firmware.jsonapplication/vnd.openchami.firmware.metadata.v1+jsonrelease-notes.txttext/plainchecksums.txttext/plainAttached artifacts via OCI referrers (the extensibility path):
application/vnd.dev.cosign.simplesigning.v1+jsonapplication/spdx+jsonapplication/vnd.in-toto+jsonapplication/vnd.openchami.firmware.test-report.v1+jsonapplication/vnd.openchami.firmware.approval.v1+jsonBecause these are custom artifact types (not container image types), a
docker pullagainst them is refused — no risk of accidental interference between firmware bundles and the org's regular container images in the same registry. The publisher pipeline can attach SBOMs/attestations the firmware service doesn't yet read; a separate CI job can re-check those layers and update labels/tags the service does read. Publisher and consumer are decoupled by the registry.Update flow
What magellan actually does, with the OCI/ORAS layer present:
vnd.openchami.firmware.payload.v1), not the full bundle. Bandwidth-friendly when bundles contain release notes, SBOMs, etc.UpdateService.SimpleUpdate(orMultipartHttpPushUpdateper vendor capability detection) withImageURIpointing at the local server. The BMC pulls; magellan polls UpdateService for progress; reports state to the state-machine engine; handles timeouts and retries.Why this is meaningfully different from "magellan with a blob cache"
Alternatives Considered
power-control,remote-console,fru-tracker, etc. Loses the central credential cache, the state-machine integration point, and the audit trail. Each service becomes its own BMC-talking daemon. Less consolidation; more places for vendor quirks to live.Other Considerations
ex-bootstrap,power-control, andremote-consolework today by calling Redfish or reading SMD. Phased migration to magellan calls. Old call paths supported during transition; deprecated cleanly.Related Docs / PRs
openchami/magellan— the repo whose evolution this RFD proposes.openchami/magellanissue #129 — BMC configuration (NTP, Syslog, SSH keys); folded into the proposed scope.openchami/smdissue #91 — the V2 parser bug; concrete example of vendor-quirk damage when there's no abstraction layer.openchami/ex-bootstrap— current home of cabinet discovery and firmware-trigger stub; candidates for consolidation into magellan.openchami/power-control,openchami/remote-console,openchami/fru-tracker— current direct or indirect BMC-touching services that would migrate to magellan calls.gpg-signing-managerRPM-signing pattern.Drafts/RFD - SMD replacement.md) — the BMC-disconnected sibling. Together these two RFDs define the architectural split between BMC-connected and BMC-disconnected planes.scratch/bmc-interaction/research.md— full capability inventory, vendor-inconsistency examples, state-machine integration patterns, open-question list.Outcome (to be filled in after discussion)
If accepted, #123 outcome:
Decision date:
Recorded by: