[sled-agent] route SIGCHLD to dedicated signal-handling thread #9982
[sled-agent] route SIGCHLD to dedicated signal-handling thread #9982
SIGCHLD to dedicated signal-handling thread #9982Conversation
This is necessary to pick up oxidecomputer/oxide-tokio-rt#4, which adds the API for configuring a dedicated signal-handling thread outside the Tokio runtime. That API uses the latest version of `nix`, which is 0.31. We presently depend on 0.30. The `nix` update is easy for us as the only breaking change here is the removal of `Eq` and `PartialEq` implementations for `SigHandler`, which we are not using --- see [their changelog][1]. [1]: https://github.com/nix-rust/nix/blob/bf1d0e9707189422f546e398594fa1a51a772d9d/CHANGELOG.md#0310---2026-01-22
As described in #9849, receiving a handled Unix signal on an application thread interrupts whatever that thread is presently doing, which interferes with IPCC communication (see oxidecomputer/stlouis#922). Since `sled-agent` stends to spawn a large number of child processes, many of which are short-lived, and it uses `tokio::process` to manage these children, we receive a lot of `SIGCHLD`s when our child processes exit. These get delivered to any arbitrary thread in the process, which can mess up IPCC stuff. Therefore, use the new `OxideBuilder::signal_thread` API added in oxidecomputer/oxide-tokio-rt#4 to set up a dedicated thread outside the runtime, which will block in `sigsuspend` in a loop, and mask out `SIGCHLD` on all other threads in the process.
|
@dancrossnyc, @jgallagher et al., I don't suppose there's a procedure to reproduce the IPCC issues? I'd like to be able to test that this change actually resolves the problem, if possible. |
The variant I've been using is available as a pull request on John's reproducer repo: But that's not testing against omicron proper. To probe the effect of the oxide-tokio-rt change within omicron, I'd probably instrument it to spawn a tokio task that loops and, at some frequency (say, 20 Hz), fork's and immediately exits in the child, and see if it impacts IPCC flows by looking at the kernel IPCC debug message buffer; if IPCC send and/or receive has been interrupted by signal delivery, you should see a notice. |
As described in #9849, receiving a handled Unix signal on an application
thread interrupts whatever that thread is presently doing, which
interferes with IPCC communication (see oxidecomputer/stlouis#922).
Since
sled-agentstends to spawn a large number of child processes,many of which are short-lived, and it uses
tokio::processto managethese children, we receive a lot of
SIGCHLDs when our child processesexit. These get delivered to any arbitrary thread in the process, which
can mess up IPCC stuff.
Therefore, this branch changes
sled-agentto use the newOxideBuilder::signal_threadAPI added inoxidecomputer/oxide-tokio-rt#4 to set up a dedicated thread outside the
runtime, which will block in
sigsuspendin a loop, and mask outSIGCHLDon all other threads in the process, ensuring that receivinga
SIGCHLDdoesn't interfere with other operations.This required updating
oxide-tokio-rtto 0.1.3 in order to pick up thenew API. This in turn also required updating our dependency on the
nixcrate from 0.30 to 0.31, as I used the latest version in
oxide-tokio-rtand it's part of the public API for the new feature.The
nixupdate is easy for us as the only breaking change here is theremoval of
EqandPartialEqimplementations forSigHandler, whichwe are not using --- see their changelog.
Closes #9849