libibnetdisc: handle incomplete fabrics during discovery and cache load#1767
Open
qvicksilver wants to merge 2 commits into
Open
libibnetdisc: handle incomplete fabrics during discovery and cache load#1767qvicksilver wants to merge 2 commits into
qvicksilver wants to merge 2 commits into
Conversation
fill_voltaire_chassis_record()'s router branch dereferences port->remoteport->node without checking that the port has a remote port, unlike the other port loops in this file. An incomplete fabric can therefore crash during group_nodes(). Add the missing port->remoteport check. Fixes: a626de1 ("Create a new library libibnetdisc") Signed-off-by: Jonathan Süssemilch Poulain <jpoulain@coreweave.com>
There was a problem hiding this comment.
Pull request overview
This PR improves libibnetdisc robustness when discovery or cache data is incomplete (e.g., PortInfo timeouts), preventing NULL dereferences during chassis grouping and ensuring cache write/load avoids dangling or missing port references.
Changes:
- Guard chassis grouping logic against missing
remoteportto avoid NULL dereference on partially discovered fabrics. - Make cache writing only reference ports that will actually be serialized, and add tolerance when loading older/incomplete caches (skip missing/out-of-range port references).
- Prevent dangling pointers during cache rebuild by skipping unclaimed port records and nulling unresolved
remoteportlinks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libibnetdisc/ibnetdisc_cache.c | Makes cache write/load resilient to incomplete port discovery and prevents dangling references during rebuild. |
| libibnetdisc/chassis.c | Adds a NULL guard for remoteport during chassis grouping to avoid dereferences on incomplete fabrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8f1756d to
4198ad7
Compare
ibnetdiscover --cache writes each node's port references and each port's remote-port reference from the in-memory fabric. If a PortInfo query times out during discovery, the port can remain as a shell in node->ports[] or as a remoteport without ever being inserted into the fabric port table. Such a port has no cache port record, but references to it were still written. ibnd_load_fabric() then rejected the cache with "Cache invalid: cannot find port". Make the cache writer emit only references to ports that are actually serialized. A port is serialized only when it is reachable from its owning node, and the same predicate is used for node port references, remote-port references, and port records. References are matched against a sorted set of serialized ports, avoiding repeated port-table walks on large fabrics. Keep nodes with no ports array as zero-port node records so fabric->from_node remains resolvable. Also tolerate old incomplete caches when reading: missing node port keys and missing remote ports are skipped instead of aborting the load. While doing that, keep malformed caches from creating unsafe fabric state by bounds-checking cached port numbers before indexing node->ports[] (an out-of-bounds write) and by skipping unclaimed port records before they are linked into the fabric and then freed (a use-after-free). Fixes: 01b8045 ("add libibnetdisc caching to libibnetdiscover") Signed-off-by: Jonathan Süssemilch Poulain <jpoulain@coreweave.com>
4198ad7 to
de83d1c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A PortInfo timeout can leave libibnetdisc with a partially discovered fabric.
That exposed two issues: chassis grouping could dereference a missing remote
port, and ibnetdiscover --cache could write references to ports that were never
serialized, making the cache fail to load later.
This PR adds the missing remoteport guard, makes cache writing only reference
serialized ports, and makes cache loading tolerate older incomplete caches by
skipping missing port references while keeping bounds checks around cached port
numbers.
Build-tested with a full cmake/ninja build in the fedora:44 AZP container,
using gcc 16.1.1. libibnetdisc compiles cleanly and links.
Runtime-tested on a real fabric where PortInfo queries timed out during
discovery. ibnetdiscover now writes a cache without dangling port references,
and the resulting cache loads successfully.
Commits:
66fcb26 libibnetdisc: guard against NULL remoteport in chassis grouping
8f1756d libibnetdisc: keep cache loadable after port timeouts