Decision Goal
Single System Image Integration
Category
Architecture
Stakeholders / Affected Areas
No response
Decision Needed By
No response
Problem Statement
While image-based provisioning using well-curated images is fast, the static nature of these images presents significant lifecycle management challenges:
- Inflexibility & Downtime: Applying updates or changes to a static image requires updating a new image and rebooting nodes to load the updated version, leading to unnecessary cluster downtime.
- Configuration Drift: To avoid reboots, administrators often rely on post-boot configuration management tools (like Ansible, Chef, or Puppet) to apply real-time modifications. Over time, this leads to state inconsistencies across the cluster.
- DevOps Friction: Experience shows that current provisioning strategies fall short. They lack the flexibility needed to accommodate live, online changes without eventually breaking the reproducibility and integrity of the DevOps pipeline.
Proposed Solution
We propose implementing support for Single System Image (SSI) provisioning using a read-only shared root filesystem as rootfs.
A recent Proof of Concept (PoC) developed by Do IT Now successfully demonstrates deploying clusters using a read-only NFSROOT. In this model, the rootfs is maintained in a standard folder on a shared file system. We are currently porting into OpenCHAMI custom dracut modules to enable SSI on top of parallel filesystems like Lustre and BeeGFS to guarantee performance and reliability at an HPC scale.
Key Benefits of this Approach:
- Zero-Downtime Updates: A simple change made to the centralized rootfs folder is instantly applied and visible to all compute nodes. This completely eliminates the need to update images and reboot nodes.
- Resource Efficiency: This approach highlights a remarkably small memory footprint on the compute nodes.
- Flexibility & Consistency: It guarantees absolute state consistency across all nodes while allowing for highly flexible cluster configuration and management.
Alternatives Considered
No response
Other Considerations
No response
Related Docs / PRs
Jordi Blasco from Do IT Now, implemented this technology in a cluster manager (sNow!), which is no longer under development.
We want to reuse part of this work and integrate it into OpenCHAMI.
More information available here: https://hpckp.org/talks/modernising-the-hpc-cluster-provisioning/
Decision Goal
Single System Image Integration
Category
Architecture
Stakeholders / Affected Areas
No response
Decision Needed By
No response
Problem Statement
While image-based provisioning using well-curated images is fast, the static nature of these images presents significant lifecycle management challenges:
Proposed Solution
We propose implementing support for Single System Image (SSI) provisioning using a read-only shared root filesystem as rootfs.
A recent Proof of Concept (PoC) developed by Do IT Now successfully demonstrates deploying clusters using a read-only NFSROOT. In this model, the rootfs is maintained in a standard folder on a shared file system. We are currently porting into OpenCHAMI custom dracut modules to enable SSI on top of parallel filesystems like Lustre and BeeGFS to guarantee performance and reliability at an HPC scale.
Key Benefits of this Approach:
Alternatives Considered
No response
Other Considerations
No response
Related Docs / PRs
Jordi Blasco from Do IT Now, implemented this technology in a cluster manager (sNow!), which is no longer under development.
We want to reuse part of this work and integrate it into OpenCHAMI.
More information available here: https://hpckp.org/talks/modernising-the-hpc-cluster-provisioning/