ExecutionEngineException fail-fast in GC thread-suspension / return-address hijack while root-scanning a Win32 message-pump P/Invoke frame (.NET 10.0.8, Windows, Avalonia desktop)

## Description

A long-running desktop app (Avalonia 12, Win32 windowing backend) reliably fail-fasts with a runtime-raised `System.ExecutionEngineException` (HRESULT `0x80131506`, faulting module `coreclr.dll`) every ~8–80 minutes of idle/normal use. The exception is raised by the runtime itself from `Debugger::HandleFatalError` → `RaiseFailFastException`, so it bypasses `AppDomain.UnhandledException` and cannot be caught.

Using `DOTNET_StressLog=1` + a full minidump (`DOTNET_DbgEnableMiniDump=1`, `DbgMiniDumpType=4`) I captured the runtime's own fatal-event trail across **three independent crashes**. All three show the **same mechanism**: the GC's thread-suspension / return-address-hijack machinery raises the fatal error while suspending a thread stopped at a **non-fully-interruptible JIT point** (`fullyInt=0`), while the **main UI thread is being GC-root-scanned inside the Win32 message pump's `DispatchMessage(MSG)` P/Invoke frame**, during a routine (not pressure-driven) GC.

**Disabling CET (`<CetCompat>false</CetCompat>`) stops the crash completely** (confirmed by a 2+ hour soak — see below), which strongly implicates CET hardware shadow-stack enforcement colliding with the GC's return-address hijack.

## Frequency / Severity

- Reproduces every ~8–80 minutes of runtime. Fatal (process dies); no managed handler can intercept.
- Reproduced across **two different builds** of the app and **two render modes** (GPU/ANGLE and CPU/software) — 3 captured dumps total.

## StressLog — the fatal sequence (representative; identical shape in all 3 dumps)

```
<tid> <ts> : SYNC   Stopped in Jitted code at pc=...  sp=...  fullyInt=0
<tid> <ts> : SYNC   Hijacking return address ...  for thread ...
<tid> <ts> : CORDB  D::HFE: About to call LogFatalError            <-- FATAL
<tid> <ts> : EH     SetLastThrownObject: obj=... (ExecutionEngineException)
<tid> <ts> : CORDB  D::RFFE: About to call RaiseFailFastException  <-- FAIL-FAST
<tid> <ts> : SYNC   Thread::SuspendAllThreads() - Success
```

- Exactly one fatal event per crash. Thrown object is the `System.ExecutionEngineException` singleton (same MethodTable across all 3 crashes), `_message=null`, `HResult=0x80131506`, exception code `0xE0434352` → runtime-raised, not a managed `throw`.
- At every `SuspendAllThreads` (including the fatal one), the main UI thread shows `Scanning Frameless method IL_STUB_PInvoke(MSG ByRef)` — i.e. it is parked in the Win32 `GetMessage`/`DispatchMessage(MSG)` P/Invoke and being root-scanned.
- The thread being **hijacked** at the fatal moment varies by crash and is doing ordinary work:
  - Crash A: main UI thread mid-`Measure` (layout).
  - Crash B: a non-main thread.
  - Crash C: a thread inside `System.DateTimeOffset.ValidateOffset` / `..ctor` — **pure CoreLib, no app/render frames** (a periodic timer's timestamp path).
- The GC at the fatal point is routine: e.g. Crash C was `BEGINGC ... requested generation = 1` — an ordinary periodic gen-1, ~1 GC / 4 s, ~35 MB heap.

## Minimal characterization of the invariant

> During a routine GC, while the main UI thread is suspended inside the Win32 `DispatchMessage(MSG)` P/Invoke frame being root-scanned, the runtime hijacks the return address of *another* thread stopped at a `fullyInt=0` JIT point and immediately enters `Debugger::HandleFatalError` → raises `ExecutionEngineException` → `RaiseFailFastException`.

The hijack-victim thread's work is incidental (layout in one crash, `DateTimeOffset` ctor in another) — the constant is the **suspend/hijack step itself faulting**, with the pump P/Invoke frame present in the root scan.

## Disabling CET stops it — confirmed

- Building the app with **`<CetCompat>false</CetCompat>`** (opts the exe out of CET shadow stacks; apphost PE `CET_COMPAT` bit verified flipped 1→0 via `dumpbin /headers`) **eliminates the crash**: the CET-off build ran **125+ minutes with zero EEE** and was still healthy, versus an 8–80 minute crash baseline on every CET-on build (3 dumps, 2 binaries, GPU and software render).
- This points at CET hardware shadow-stack enforcement as the trigger: a GC return-address hijack sets a thread's return address to a stub that is not on the shadow stack, which CET treats as a control-flow violation and terminates the process. The runtime has a CET-safe suspension path using special user-mode APCs (`QueueUserAPC2`); this looks like a case where the hijack path is taken while shadow stacks are active.
- `<CetCompat>false>` is a usable mitigation but disables a security feature for the whole process — an in-runtime fix (always take the CET-safe suspension path when shadow stacks are enabled) would be preferable.

## Ruled out (with evidence)

1. **GPU / Skia / ANGLE render path — exonerated.** Forced CPU software rendering (no `av_libGLESv2.dll` / `av_libEGL.dll` / `nvwgf2umx.dll` / `d3d11.dll` / `dxgi.dll` / `opengl32.dll` / `vulkan-1.dll` loaded — verified by module-list diff vs the GPU-mode dumps) still crashed with the identical mechanism in ~8 min, on a thread doing `DateTimeOffset` construction with zero render frames.
2. **Allocation pressure — exonerated.** Fatal GCs are routine periodic gen-0/gen-1 at ~1/4 s with a ~35 MB heap; a low-allocation run crashed *faster* than a high-allocation one. No GC heap corruption (`verifyheap` clean).
3. **Application exception storm — exonerated.** An earlier high-rate `InvalidDataException` storm (an app-side deserialization bug, unrelated to the runtime) was fixed; with it gone (`InvalidDataException` = 0 across a 273 MB StressLog) the EEE still reproduced. Background exception rate at crash is ~1/8 s, not a storm.

## Environment

- .NET: **10.0.8** (DAC 10.0.826.23019)
- OS: **Windows 11, build 26200** (x64)
- UI framework: **Avalonia 12**, Win32 windowing backend, classic desktop lifetime, single-threaded WPF-style UI message pump
- `TieredCompilation` already disabled; reproduces with and without tiering knobs.
- App is mostly idle at crash time (periodic timers + UI message pump + routine GC).

## Possibly related

- **#112598** — WinForms + WebView2 (Win32 message pump + native interop), random `0x80131506` / `coreclr.dll` / `0xc0000005`, 1–10 min after launch, worse with more RAM, escalated after Windows 11 24H2. Open, `tracking-external-issue`, no working workaround. Closest match; the 24H2 correlation implicates the OS CET/shadow-stack/APC change.
- **#108118** — ElementHost / WinForms sibling crash (same family).
- **#83437** — the runtime assertion `!AreCetShadowStacksEnabled() || UseSpecialUserModeApc()` (encodes "shadow stacks on ⇒ must be on the special-APC suspension path").
- dotnet/core **#6733** — `QueueUserAPC2` faults on older CPUs (the CET-safe APC path's own failure mode).

## Repro assets available on request

- 3 full minidumps (Type=4) with armed StressLog (`DOTNET_StressLog=1`, 128 MB/thread ring); the two storm-free dumps are cleanest (~273 MB each, can be trimmed to the fatal window).
- `clrstack` / `clrthreads` / `threads` / module-list captures per dump.

## Questions for the runtime team

1. Is this a known issue in the **return-address hijack / `EEPolicy::HandleFatalError`** path on Windows when a thread is suspended at a `fullyInt=0` point while another thread's `IL_STUB_PInvoke(MSG ByRef)` frame is being root-scanned?
2. **CET:** when shadow stacks are enabled, under what conditions can the GC take the **hijack** suspension path (return-address overwrite) instead of the CET-safe `QueueUserAPC2` path? We've confirmed `<CetCompat>false>` stops the crash — is that the recommended mitigation on 10.0.8, or is an in-runtime fix planned?
3. Do any suspension-timing knobs change time-to-crash as a race corroborator: `DOTNET_gcServer=1`, `DOTNET_GCgen0size=<larger>`, `DOTNET_TieredPGO=0`, `DOTNET_JITMinOpts=1`? (`DOTNET_ThreadSuspendInjection` is Unix-only and does not apply on Windows.)
4. Is there a recommended mitigation for desktop apps on 10.0.8 short of pinning an earlier runtime, other than opting out of CET?

---

_Suggested areas: `area-VM-coreclr`, `area-GC-coreclr`._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExecutionEngineException fail-fast in GC thread-suspension / return-address hijack while root-scanning a Win32 message-pump P/Invoke frame (.NET 10.0.8, Windows, Avalonia desktop) #129071

Description

Frequency / Severity

StressLog — the fatal sequence (representative; identical shape in all 3 dumps)

Minimal characterization of the invariant

Disabling CET stops it — confirmed

Ruled out (with evidence)

Environment

Possibly related

Repro assets available on request

Questions for the runtime team

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ExecutionEngineException fail-fast in GC thread-suspension / return-address hijack while root-scanning a Win32 message-pump P/Invoke frame (.NET 10.0.8, Windows, Avalonia desktop) #129071

Description

Description

Frequency / Severity

StressLog — the fatal sequence (representative; identical shape in all 3 dumps)

Minimal characterization of the invariant

Disabling CET stops it — confirmed

Ruled out (with evidence)

Environment

Possibly related

Repro assets available on request

Questions for the runtime team

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions