Skip to content

[sled-agent] route SIGCHLD to dedicated signal-handling thread #9982

Open
hawkw wants to merge 3 commits intomainfrom
eliza/sled-agent-signal-thread
Open

[sled-agent] route SIGCHLD to dedicated signal-handling thread #9982
hawkw wants to merge 3 commits intomainfrom
eliza/sled-agent-signal-thread

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Mar 5, 2026

As described in #9849, receiving a handled Unix signal on an application
thread interrupts whatever that thread is presently doing, which
interferes with IPCC communication (see oxidecomputer/stlouis#922).
Since sled-agent stends to spawn a large number of child processes,
many of which are short-lived, and it uses tokio::process to manage
these children, we receive a lot of SIGCHLDs when our child processes
exit. These get delivered to any arbitrary thread in the process, which
can mess up IPCC stuff.

Therefore, this branch changes sled-agent to use the new
OxideBuilder::signal_thread API added in
oxidecomputer/oxide-tokio-rt#4 to set up a dedicated thread outside the
runtime, which will block in sigsuspend in a loop, and mask out
SIGCHLD on all other threads in the process, ensuring that receiving
a SIGCHLD doesn't interfere with other operations.

This required updating oxide-tokio-rt to 0.1.3 in order to pick up the
new API. This in turn also required updating our dependency on the nix
crate from 0.30 to 0.31, as I used the latest version in
oxide-tokio-rt and it's part of the public API for the new feature.
The nix update is easy for us as the only breaking change here is the
removal of Eq and PartialEq implementations for SigHandler, which
we are not using --- see their changelog.

Closes #9849

hawkw added 2 commits March 5, 2026 09:42
This is necessary to pick up oxidecomputer/oxide-tokio-rt#4, which adds
the API for configuring a dedicated signal-handling thread outside the
Tokio runtime. That API uses the latest version of `nix`, which is 0.31.
We presently depend on 0.30. The `nix` update is easy for us as the only
breaking change here is the removal of `Eq` and `PartialEq`
implementations for `SigHandler`, which we are not using --- see [their
changelog][1].

[1]:
https://github.com/nix-rust/nix/blob/bf1d0e9707189422f546e398594fa1a51a772d9d/CHANGELOG.md#0310---2026-01-22
As described in #9849, receiving a handled Unix signal on an application
thread interrupts whatever that thread is presently doing, which
interferes with IPCC communication (see oxidecomputer/stlouis#922).
Since `sled-agent` stends to spawn a large number of child processes,
many of which are short-lived, and it uses `tokio::process` to manage
these children, we receive a lot of `SIGCHLD`s when our child processes
exit. These get delivered to any arbitrary thread in the process, which
can mess up IPCC stuff. Therefore, use the new
`OxideBuilder::signal_thread` API added in
oxidecomputer/oxide-tokio-rt#4 to set up a dedicated thread outside the
runtime, which will block in `sigsuspend` in a loop, and mask out
`SIGCHLD` on all other threads in the process.
@hawkw
Copy link
Member Author

hawkw commented Mar 5, 2026

@dancrossnyc, @jgallagher et al., I don't suppose there's a procedure to reproduce the IPCC issues? I'd like to be able to test that this change actually resolves the problem, if possible.

@dancrossnyc
Copy link
Contributor

@dancrossnyc, @jgallagher et al., I don't suppose there's a procedure to reproduce the IPCC issues? I'd like to be able to test that this change actually resolves the problem, if possible.

The variant I've been using is available as a pull request on John's reproducer repo:
oxidecomputer/john-ipcc-signals#1

But that's not testing against omicron proper.

To probe the effect of the oxide-tokio-rt change within omicron, I'd probably instrument it to spawn a tokio task that loops and, at some frequency (say, 20 Hz), fork's and immediately exits in the child, and see if it impacts IPCC flows by looking at the kernel IPCC debug message buffer; if IPCC send and/or receive has been interrupted by signal delivery, you should see a notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sled-agent could handle Unix signals better

2 participants