Skip to content

libibnetdisc: handle incomplete fabrics during discovery and cache load#1767

Open
qvicksilver wants to merge 2 commits into
linux-rdma:masterfrom
qvicksilver:fix-cache-timeout-ports
Open

libibnetdisc: handle incomplete fabrics during discovery and cache load#1767
qvicksilver wants to merge 2 commits into
linux-rdma:masterfrom
qvicksilver:fix-cache-timeout-ports

Conversation

@qvicksilver

Copy link
Copy Markdown
Contributor

A PortInfo timeout can leave libibnetdisc with a partially discovered fabric.
That exposed two issues: chassis grouping could dereference a missing remote
port, and ibnetdiscover --cache could write references to ports that were never
serialized, making the cache fail to load later.

This PR adds the missing remoteport guard, makes cache writing only reference
serialized ports, and makes cache loading tolerate older incomplete caches by
skipping missing port references while keeping bounds checks around cached port
numbers.

Build-tested with a full cmake/ninja build in the fedora:44 AZP container,
using gcc 16.1.1. libibnetdisc compiles cleanly and links.

Runtime-tested on a real fabric where PortInfo queries timed out during
discovery. ibnetdiscover now writes a cache without dangling port references,
and the resulting cache loads successfully.

Commits:
66fcb26 libibnetdisc: guard against NULL remoteport in chassis grouping
8f1756d libibnetdisc: keep cache loadable after port timeouts

fill_voltaire_chassis_record()'s router branch dereferences
port->remoteport->node without checking that the port has a remote port,
unlike the other port loops in this file. An incomplete fabric can
therefore crash during group_nodes().

Add the missing port->remoteport check.

Fixes: a626de1 ("Create a new library libibnetdisc")
Signed-off-by: Jonathan Süssemilch Poulain <jpoulain@coreweave.com>
Copilot AI review requested due to automatic review settings June 26, 2026 09:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves libibnetdisc robustness when discovery or cache data is incomplete (e.g., PortInfo timeouts), preventing NULL dereferences during chassis grouping and ensuring cache write/load avoids dangling or missing port references.

Changes:

  • Guard chassis grouping logic against missing remoteport to avoid NULL dereference on partially discovered fabrics.
  • Make cache writing only reference ports that will actually be serialized, and add tolerance when loading older/incomplete caches (skip missing/out-of-range port references).
  • Prevent dangling pointers during cache rebuild by skipping unclaimed port records and nulling unresolved remoteport links.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
libibnetdisc/ibnetdisc_cache.c Makes cache write/load resilient to incomplete port discovery and prevents dangling references during rebuild.
libibnetdisc/chassis.c Adds a NULL guard for remoteport during chassis grouping to avoid dereferences on incomplete fabrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libibnetdisc/ibnetdisc_cache.c Outdated
Comment thread libibnetdisc/ibnetdisc_cache.c Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread libibnetdisc/ibnetdisc_cache.c
Comment thread libibnetdisc/ibnetdisc_cache.c
ibnetdiscover --cache writes each node's port references and each
port's remote-port reference from the in-memory fabric. If a PortInfo
query times out during discovery, the port can remain as a shell in
node->ports[] or as a remoteport without ever being inserted into the
fabric port table. Such a port has no cache port record, but references
to it were still written. ibnd_load_fabric() then rejected the cache with
"Cache invalid: cannot find port".

Make the cache writer emit only references to ports that are actually
serialized. A port is serialized only when it is reachable from its
owning node, and the same predicate is used for node port references,
remote-port references, and port records. References are matched against
a sorted set of serialized ports, avoiding repeated port-table walks on
large fabrics. Keep nodes with no ports array as zero-port node records
so fabric->from_node remains resolvable.

Also tolerate old incomplete caches when reading: missing node port keys
and missing remote ports are skipped instead of aborting the load. While
doing that, keep malformed caches from creating unsafe fabric state by
bounds-checking cached port numbers before indexing node->ports[] (an
out-of-bounds write) and by skipping unclaimed port records before they
are linked into the fabric and then freed (a use-after-free).

Fixes: 01b8045 ("add libibnetdisc caching to libibnetdiscover")
Signed-off-by: Jonathan Süssemilch Poulain <jpoulain@coreweave.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants