Skip to content

[RFD]: Replacing SMD — inventory-service, TypeID, and the Path Forward #134

Description

@alexlovelltroy

Decision Goal

Should OpenCHAMI adopt openchami/inventory-service as the canonical replacement for SMD, retargeted to the data model proposed in #112 with TypeID as the primary identifier format, and bridged to existing SMD consumers through a time-bounded compatibility shim? This decision unblocks a cluster of dependent work — TPM enrollment (#40), geolocated IP assignment for hot-swap (#29), the node lifecycle state machine (#91), and the hardware inventory data model (#112) — all of which are blocked or constrained by the identity question.

Category

Architecture

Stakeholders / Affected Areas

Every site operating OpenCHAMI and every service that currently talks to SMD: bss, boot-service, ochami, cloud-init, metadata-service, power-control, coresmd, magellan, ansible-smd-inventory, and all downstream community deployments. The TSC owns the outcome. HPE and Dell (as vendor members) have production deployment interest. This decision also directly affects openchami/fabrica — the code generation framework — and every service built on it (fru-tracker, boot-service, and future fabrica-generated services).

Decision Needed By

Before further investment lands in inventory-service, before #112 is accepted or rejected, and before #40 can complete its identity model. Realistically the next 1–2 TSC meetings.

Problem Statement

A. The xname conflation problem

SMD inherited from Cray's CSM a single string — the xname (e.g.,
x1000c1s7b0n0) — that simultaneously encodes two distinct concepts:

  1. Physical location — which cabinet, chassis, slot, BMC port, and node
    index a component currently occupies.
  2. Hardware identity — which physical device is being referred to.

These are not the same thing, and treating them as the same causes compounding
problems throughout the stack:

  • RFD [RFD] TPM Enrollment and secure secret delivery #40 (TPM Enrollment, in progress) states this directly: "location,
    while unique across the system, isn't stable. It is possible and even
    somewhat common to remove a blade from one chassis and replace it in
    another."
    The TPM work requires a stable, hardware-bound identity — IDevID /
    LAK certificates — that travels with the device, not the slot.
  • RFD [RFD] Geolocated IP addresses to facilitate hot-swappable servers #29 (Geolocated IPs for hot-swap) identifies that when hardware moves
    to a new xname, its BMC MAC address changes, breaking the expectation that
    management IPs are location-stable. The proposed workaround — rewriting MAC
    addresses — is hardware-dependent. The root cause is the xname simultaneously
    driving DHCP identity and hardware identity.
  • Lifecycle state ([RFD] Node Lifecycle State Machine #91) asks whether node state should live in the inventory
    service. The answer is ambiguous as long as "the node" means "the location"
    rather than "the device." Moving a blade silently resets its lifecycle history
    under the current model.
  • Cray vocabulary burden: The xname encoding (x, c, s, b, n for
    cabinet, chassis, slot, BMC, node) is Cray-specific and meaningless to
    operators using non-Cray hardware or non-Cray rack naming conventions.

B. inventory-service inherits this debt by construction

openchami/inventory-service
is a fabrica-generated, API-compatible rewrite of SMD. It has valuable
properties — BMC-discovery has been pruned, tokensmith handles AuthN/AuthZ, and
it runs as a lightweight Go binary. But because it is API-compatible, it
inherits SMD's schema, including the xname as primary key. Continuing
inventory-service as-is is implicitly a decision to continue the
xname-as-identity model. That should be a conscious choice, not a default.

C. RFD #112 already proposed a clean data model

RFD #112 (Hardware Inventory API, proposed Sept 2025) independently arrived at
a different model:

  • Device.id is a UUID — stable, system-assigned, not location-encoded.
  • Physical location is a separate concept; the xname, if needed at all, is
    a properties entry that magellan populates when it knows it.
  • parentID expresses hierarchy (GPU belongs to Node, Node belongs to Chassis)
    without Cray vocabulary.
  • An arbitrary properties map handles vendor-specific fields without schema
    changes.

This model directly resolves the instability problems in #40 and #29. It is
also the foundation of #112's proposed Inventory / History / Collection API
trio. However, #112 has no implementation and its relationship to
inventory-service has not been decided. This RFD decides it.

D. Plain UUIDs are stable but opaque

A bare UUID (d5e6f7a8-b9c0-d1e2-f3g4-h5i6j7k8l9m0) carries no information
about what kind of thing it identifies. An operator grepping a log file or
debugging a boot failure has no way to distinguish a node ID from a BMC ID
without a separate lookup. This replaces Cray vocabulary friction with total
opacity.

It also creates a practical bug class: nothing in the type system prevents
passing a BMC identifier to a function expecting a node identifier. SMD's xnames
at least made the shape visible — x1000c1s7b0 and x1000c1s7b0n0 are
obviously different kinds of thing. A UUID-only model loses that signal entirely.

The xname's readability was a feature. The proposed solution preserves it.

E. fabrica already has a prefixed ID scheme — but with insufficient entropy

openchami/fabrica already generates IDs in the form prefix-<8-random-hex-chars>
(e.g., device-1a2b3c4d). fru-tracker registers "Device" → "device" and
uses GenerateUIDForResource("Device") to produce these IDs at record creation.

This scheme has the right instinct — type prefix plus opaque suffix — but two
problems prevent using it as-is for the inventory plane:

  1. 32 bits of entropy (4 billion values per prefix) is insufficient for
    global uniqueness across federated sites. A birthday attack reaches 50%
    collision probability at ~65,000 records per prefix.
  2. ParseUID enforces prefix-hex format using strings.Split(uid, "-"),
    so it would reject TypeID's underscore-separated, base32-encoded format.

TypeID is the natural evolution of fabrica's existing instinct, with a published
spec, a Go library, and a UUID v7 suffix that is both globally unique and
time-sortable.

F. The existing SMD bugs that motivated this work

Real and unresolved, but secondary to the identity question:

Any replacement path addresses these. The identity question cannot be fixed by
patching SMD.

Proposed Solution

Retarget inventory-service to implement the #112 Device model with TypeID
as the primary identifier, and bridge existing SMD consumers through a
time-bounded compatibility shim.

1. What TypeID is

TypeID is a published open specification
for type-safe, time-sortable, globally unique identifiers, inspired by Stripe's
cus_, ch_ pattern. A TypeID has two parts separated by an underscore:

node_01926e3ca4f17xyz8ab9cd0ef1
└──┘ └──────────────────────────┘
type   UUID v7 in Crockford base32
prefix     (26 chars, URL-safe)
  • The type prefix is a lowercase snake_case string. An operator reading
    any log line, API response, or database record immediately knows the kind of
    resource.
  • The suffix is a UUID v7 encoded in Crockford base32 — globally unique,
    monotonically time-sortable, 26 characters, URL-safe without quoting.
  • The full identifier is a single string with no hyphens: node_01926e3ca4f17xyz8ab9cd0ef1.

TypeID is supported by a Go library (github.com/jetify-com/typeid-go) and
implementations in most major languages. UUID v7 is an IETF standard (RFC 9562,
2024); Postgres stores it natively in the uuid column type, and the
time-ordered insertion pattern eliminates the B-tree index fragmentation of UUID
v4.

2. The prefix vocabulary

The TSC should own and publish a canonical OpenCHAMI TypeID prefix registry.
Proposed initial prefixes:

Prefix Resource
node Compute node
bmc Baseboard management controller
chassis Chassis or blade enclosure
rack Physical rack
switch Network switch
gpu GPU device (child of node)
nic Network interface card
dimm Memory module
pdu Power distribution unit

This list is not exhaustive and grows by TSC approval. Just as the xname grammar
was governed by Cray, the TypeID prefix registry is governed by the OpenCHAMI
TSC. It is a governance artifact, not a code artifact — a short Markdown
document in the .github or community repo.

3. What the new inventory record looks like

{
  "id": "node_01926e3ca4f17xyz8ab9cd0ef1",
  "deviceType": "Node",
  "manufacturer": "HPE",
  "partNumber": "P38472-B21",
  "serialNumber": "CZ123456789",
  "parentID": "chassis_01926e3ca4f07abc1de2fg3hi",
  "properties": {
    "xname": "x1000c1s7b0n0",
    "nid": 42,
    "tpm.idevid_cert": "..."
  }
}

node_... is a node. chassis_... is its parent. The parentID is
type-legible without a lookup. When a blade moves from slot 7 to slot 3,
magellan updates properties.xname. FRU history, lifecycle state, attestation
certificates, and error logs all remain attached to the same node_... TypeID.

4. Changes to inventory-service

inventory-service is the right vehicle for this implementation; the concern
is the schema, not the code. Concretely:

  1. Adopt TypeID as the primary key. The fabrica code-generation model
    derives schema from Go structs. The id field changes from an xname string
    to a TypeID wrapper type. The UUID v7 suffix is stored in a native uuid
    Postgres column for index efficiency; the prefix is validated at the service
    layer. This is a struct change, not a live-database migration.
  2. Implement the Device model from [RFD]: Hardware Inventory API and Data Model Proposal #112properties as
    map[string]json.RawMessage, parentID as a TypeID reference, deviceType
    as an extensible string rather than a hard-coded Cray enum.
  3. Magellan writes xname into properties. Magellan populates
    properties.xname as part of its inventory push. Location is discoverable
    data, not foundational identity.
  4. Query by property. GET /devices?properties.xname=x1000c1s7b0n0 lets
    existing callers resolve a TypeID from an xname without knowing it up front.
    The service returns the TypeID; callers use it for all subsequent operations.

5. The SMD compatibility shim

Services that currently talk to SMD — bss, cloud-init, coresmd,
power-control, ansible-smd-inventory — cannot all migrate simultaneously.
An explicit compatibility shim provides runway.

Design: The shim is a thin reverse proxy that translates between the
SMD-compatible wire format (xname-primary, Cray vocabulary) and the
inventory-service API (TypeID-primary, #112 model). It is deployed alongside
inventory-service, not inside it.

Legacy consumer  →  SMD shim  →  inventory-service (TypeID-primary)
    (xname)           │                    │
                      │  GET /devices?     │
                      │  properties.xname= │
                      └────────────────────┘

Specifically, the shim:

  • Accepts requests on the /hsm/v2/... path prefix (SMD's existing API shape).
  • Translates xname-based queries to properties.xname lookups against
    inventory-service.
  • Translates TypeID responses back to xname-primary responses for the legacy
    caller.
  • Is stateless and carries no inventory data of its own.
  • Ships with a committed sunset date — not open-ended compatibility.

Scope: The shim covers the read-heavy paths that existing services actually
use: State/Components, Inventory/Hardware, hsm/v2/State/Components group
queries. It does not need to replicate SMD's discovery endpoints, which are
magellan's responsibility.

Sunset timeline: The shim should be deprecated on a published date,
suggested 12–18 months after inventory-service reaches feature parity with
SMD's inventory capabilities. Sites that have not migrated by that date are
pinned to the previous SMD release.

6. Changes to fabrica

TypeID adoption in inventory-service surfaces a natural next step: update
fabrica so that all fabrica-generated services benefit automatically.

What changes in fabrica:

  • GenerateUID / GenerateUIDForResource in pkg/resource/resource.go:
    replace prefix-<8hex> generation with TypeID generation
    (prefix_<uuidv7-base32>). Add github.com/jetify-com/typeid-go as a
    dependency.
  • ParseUID: currently assumes strings.Split(uid, "-") with a hex suffix.
    Replace with TypeID parsing, or retain as a format-aware utility that accepts
    both formats during a transition window.
  • IsValidUID and GetResourceTypeFromUID: update to match.
  • routes.go.tmpl: the line resource.RegisterResourcePrefix("{{.Name}}", "{{toLower .Name}}") continues to work — toLower produces valid TypeID prefixes. DiscoverySnapshotdiscoverysnapshot is a valid TypeID prefix.

Breaking change assessment: The UID field is typed string throughout the
fabrica storage layer (StorageBackend, ent schema field.String("uid")).
There are no uuid.UUID typed fields to migrate. ParseUID is not called in
any generated handler or storage template — it is a diagnostic utility. The file
backend uses the UID only as a filename component, guarding only against path
traversal (. and /). New-format TypeIDs are valid filenames.

The breaking change is data-level, not code-level: existing records stored
with device-1a2b3c4d format UIDs coexist with new device_01926e... format
records in the same database, because the UID column is a plain string. A
one-time migration utility should be provided to rewrite existing records to
TypeID format for services that want consistency. fru-tracker is at v0.1.0
with no production deployments, so the timing is favorable.

7. Composability constraint (unchanged)

OpenCHAMI components must not hard-depend on heavy external systems to function.
inventory-service must be deployable standalone. Sites that run DCIM platforms
(Nautobot, NetBox, or others) can build integration adapters against the
inventory-service API; those adapters are optional components outside
inventory-service's deployment footprint.

8. Spec publication (end state, not precondition)

Once inventory-service has converged through real use, extract the OpenAPI
spec from the implementation and publish it with a conformance test suite.
The spec is a deliverable, not a constraint. The TypeID prefix registry should
be published alongside it as a companion governance document.

Alternatives Considered

Option A: Continue inventory-service as xname-compatible SMD rewrite

Keep the xname as primary key, fix the discrete bugs, and accept the identity
model as-is. Lowest implementation risk. Preserves full backward compatibility
with all existing consumers.

Why not recommended: The instability problems in #40 and #29 remain
structural. The lifecycle state machine (#91) cannot cleanly answer "what
happens to node state when hardware moves?" The TPM work (#40) must either bolt
TPM IDs on as secondary keys alongside xnames (increasing complexity) or create
a separate identity service (introducing consistency problems). Every downstream
issue that stems from location-as-identity gets harder to address the longer the
xname is the primary key.

Option B: Plain UUID without a type prefix

Use bare UUID as in #112's original proposal. Globally unique, location-stable,
universally supported by databases and language runtimes.

Why not recommended as primary format: Opaque to operators — no signal about
resource type in the identifier itself. Creates a latent bug class where a BMC
UUID can be passed to a function expecting a node UUID without any type-system
signal. TypeID adds readability and Go type safety at negligible cost over bare
UUID v7. If the TSC judges TypeID's non-IETF-standard status unacceptable, UUID
v7 (RFC 9562) is the preferred fallback. UUID v4 should not be chosen because
its random insertion order causes B-tree index fragmentation at HPC scale.

Option C: Build from scratch rather than evolve inventory-service

Implement the #112 model in a new service rather than evolving
inventory-service.

Why not recommended: inventory-service already has meaningful work done —
fabrica alignment, tokensmith integration, BMC-plane separation, initial Go
structure. The concern is the schema, not the code. The schema change (struct
update + TypeID dependency) is far less work than building a new service.

Option D: Permanent dual-API (TypeID internal, xname external forever)

Maintain both a TypeID-primary internal model and a permanent xname-primary API
surface, supporting both indefinitely rather than teiming-bounding the shim.

Why not recommended: Permanent dual-API creates maintenance burden and
removes the incentive for consumers to migrate. The shim should have a published
sunset date. Open-ended compatibility is a slow path to SMD's current situation.

Option E: Incremental SMD (do nothing)

Patch bugs in SMD, defer replacement. Lowest risk, but architectural debt
compounds. The xname instability is not addressable through bug patches.

Comparison of approaches

✓✓ = strong · ✓ = adequate · ○ = workable but lossy · ✗ = poor

Aspect TypeID in inventory-service (recommended) Plain UUID v7 xname-compat rewrite Incremental SMD
Resolves identity/location conflation ✓✓ ✓✓
Human-readable type at a glance ✓✓ ✓ (Cray only) ✓ (Cray only)
Type safety enforceable in Go ✓✓
Globally unique across sites ✓✓ ✓✓ ✗ (xname is local)
DB index performance (B-tree ordered) ✓✓ ✓✓ ✓✓ (string) ✓✓
Enables stable TPM identity (#40) ✓✓ ✓✓
Enables stable lifecycle state (#91) ✓✓ ✓✓
Custom hardware types extensible ✓✓ ✓✓
Backward compat for SMD consumers ✓ (shim, time-bounded) ✓ (shim) ✓✓ ✓✓
Reuses existing inventory-service work ✓✓ ✓✓ ✓✓ n/a
Consistent with #112 data model ✓✓ ✓✓
Alignment with existing fabrica direction ✓✓ (natural evolution)

Other Considerations

  • Relationship between this RFD, [RFD]: Hardware Inventory API and Data Model Proposal #112, and inventory-service. These three
    things are currently independent. This RFD proposes resolving that: [RFD]: Hardware Inventory API and Data Model Proposal #112's
    Device model is the target schema for inventory-service, and TypeID
    replaces [RFD]: Hardware Inventory API and Data Model Proposal #112's original bare UUID id field. If there are concerns with
    [RFD]: Hardware Inventory API and Data Model Proposal #112's model, raise them in that issue; this RFD assumes the
    location-decoupled, properties-map approach is directionally correct.

  • xname doesn't disappear; it loses primary key status. Sites with
    operational tooling built on xnames — console servers, Slurm node names,
    runbooks — don't need to change overnight. The xname lives in properties,
    is queryable via the property filter API, and magellan continues to populate
    it. The shim keeps legacy services working during the transition.

  • Magellan's role expands slightly. Under the current xname model, magellan
    registers hardware by submitting an xname. Under the TypeID model, magellan
    submits hardware with its manufacturer/serial/part data; inventory-service
    assigns the TypeID; magellan annotates the returned device with
    properties.xname based on its scan context. This is a small but real change
    to the magellan → inventory-service handshake and warrants a follow-on RFD.

  • coresmd is the hardest downstream consumer. It calls SMD's API directly
    for DHCP lease generation keyed on xname. The shim covers this during the
    transition. Long-term, coresmd resolves the xname to a TypeID via
    GET /devices?properties.xname=... and then uses the TypeID for all
    subsequent operations within a session.

  • TPM identity storage ([RFD] TPM Enrollment and secure secret delivery #40). The TPM RFD can store tpm.idevid_cert and
    related fields in properties without a separate identity service. The stable
    node_... TypeID is the anchor. This resolves the design fork in [RFD] TPM Enrollment and secure secret delivery #40 between
    "add to SMD" vs. "create a new service."

  • Lifecycle state machine ([RFD] Node Lifecycle State Machine #91). State attaches to a TypeID-identified
    device, not a location. Moving hardware updates properties.xname; the
    lifecycle record stays attached to the TypeID and is unaffected.

  • Multi-vendor Redfish (roadmap [RFD]: Redfish Interface Strategy #123). Vendor-specific Redfish quirks belong
    only in magellan. Magellan's vendor normalization produces a clean Device
    record regardless of vendor; inventory-service never sees vendor-specific
    encoding.

  • Single-maintainer risk. inventory-service has one primary committer.
    Growing active contributors is a parallel objective, independent of this
    decision.

  • Performance. TypeID suffixes are UUID v7 values — monotonically
    time-ordered, no B-tree page splits. This is a performance improvement over
    UUID v4 and comparable to string xname columns at HPC scale. Verify against
    SMD's production workload before full promotion.

  • Phasing and feature parity. The first milestone for inventory-service
    under the TypeID model: feature parity with SMD's inventory capabilities (not
    its discovery capabilities, which are magellan's responsibility). New features
    and SMD feature pruning are deferred to subsequent RFDs.

  • Data sovereignty. Once the implementation has converged, the TSC should
    own and publish both the API spec and the TypeID prefix registry independently
    of any single implementation.

Work Items

  1. Align [RFD]: Hardware Inventory API and Data Model Proposal #112 and inventory-service. TSC formally adopts [RFD]: Hardware Inventory API and Data Model Proposal #112's Device
    model as inventory-service's target schema.
  2. TypeID primary key in inventory-service. Add
    github.com/jetify-com/typeid-go. Change id field to TypeID; store UUID
    v7 suffix in a native uuid Postgres column.
  3. properties map. Implement map[string]json.RawMessage including
    xname as a first-class property key that magellan populates.
  4. Query by property. GET /devices?properties.xname=... and similar filter
    expressions for xname-based lookup.
  5. SMD compatibility shim. Build and deploy the translation layer covering
    the read paths bss, cloud-init, coresmd, remote-console and power-control actually
    use. Commit a sunset date.
  6. TSC prefix registry. Publish the canonical prefix list as a governance
    document. Establish the process for adding prefixes.
  7. TypeID in fabrica. Update GenerateUID / GenerateUIDForResource to
    produce TypeIDs. Update ParseUID to accept TypeID format. Ship a migration
    utility for services with existing prefix-<hex> records. Bump fabrica minor
    version.
  8. BSS, cloud-init, coresmd, power-control, remote-console adapters. Each downstream service
    migrates from xname lookup to TypeID-first lookup with properties.xname
    filter for legacy resolution. Coordinate with shim sunset.
  9. Magellan handshake update. Adapt magellan to submit device data without
    an xname primary key; receive a TypeID back; annotate with properties.xname.
    Follow-on RFD for the full interface spec.
  10. TPM identity storage ([RFD] TPM Enrollment and secure secret delivery #40). Store tpm.idevid_cert and related fields
    in properties.
  11. Lifecycle state machine ([RFD] Node Lifecycle State Machine #91). Implement with state keyed on TypeID, not
    xname.
  12. Idempotent PATCH semantics (smd [RFD] API Tagging #79 equivalent). Implement in
    inventory-service.
  13. Virtual and hybrid device types (smd [FEATURE] query multiple xnames under the same http request #71 equivalent). deviceType field
    plus open prefix registry; virtual nodes use prefix node, distinguished by
    deviceType.
  14. Migration tooling. Shadow mode → consumer cutover → SMD decommission.
    Coordinate shim sunset with magellan rollout progress.
  15. Performance benchmarking. Verify on-par or better performance against
    SMD's production workload before promoting to default.

Related Docs / PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfdRequest for Discussion

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions