Skip to content

ExecutionEngineException fail-fast in GC thread-suspension / return-address hijack while root-scanning a Win32 message-pump P/Invoke frame (.NET 10.0.8, Windows, Avalonia desktop) #129071

@steveh-iz

Description

@steveh-iz

Description

A long-running desktop app (Avalonia 12, Win32 windowing backend) reliably fail-fasts with a runtime-raised System.ExecutionEngineException (HRESULT 0x80131506, faulting module coreclr.dll) every ~8–80 minutes of idle/normal use. The exception is raised by the runtime itself from Debugger::HandleFatalErrorRaiseFailFastException, so it bypasses AppDomain.UnhandledException and cannot be caught.

Using DOTNET_StressLog=1 + a full minidump (DOTNET_DbgEnableMiniDump=1, DbgMiniDumpType=4) I captured the runtime's own fatal-event trail across three independent crashes. All three show the same mechanism: the GC's thread-suspension / return-address-hijack machinery raises the fatal error while suspending a thread stopped at a non-fully-interruptible JIT point (fullyInt=0), while the main UI thread is being GC-root-scanned inside the Win32 message pump's DispatchMessage(MSG) P/Invoke frame, during a routine (not pressure-driven) GC.

Disabling CET (<CetCompat>false</CetCompat>) stops the crash completely (confirmed by a 2+ hour soak — see below), which strongly implicates CET hardware shadow-stack enforcement colliding with the GC's return-address hijack.

Frequency / Severity

  • Reproduces every ~8–80 minutes of runtime. Fatal (process dies); no managed handler can intercept.
  • Reproduced across two different builds of the app and two render modes (GPU/ANGLE and CPU/software) — 3 captured dumps total.

StressLog — the fatal sequence (representative; identical shape in all 3 dumps)

<tid> <ts> : SYNC   Stopped in Jitted code at pc=...  sp=...  fullyInt=0
<tid> <ts> : SYNC   Hijacking return address ...  for thread ...
<tid> <ts> : CORDB  D::HFE: About to call LogFatalError            <-- FATAL
<tid> <ts> : EH     SetLastThrownObject: obj=... (ExecutionEngineException)
<tid> <ts> : CORDB  D::RFFE: About to call RaiseFailFastException  <-- FAIL-FAST
<tid> <ts> : SYNC   Thread::SuspendAllThreads() - Success
  • Exactly one fatal event per crash. Thrown object is the System.ExecutionEngineException singleton (same MethodTable across all 3 crashes), _message=null, HResult=0x80131506, exception code 0xE0434352 → runtime-raised, not a managed throw.
  • At every SuspendAllThreads (including the fatal one), the main UI thread shows Scanning Frameless method IL_STUB_PInvoke(MSG ByRef) — i.e. it is parked in the Win32 GetMessage/DispatchMessage(MSG) P/Invoke and being root-scanned.
  • The thread being hijacked at the fatal moment varies by crash and is doing ordinary work:
    • Crash A: main UI thread mid-Measure (layout).
    • Crash B: a non-main thread.
    • Crash C: a thread inside System.DateTimeOffset.ValidateOffset / ..ctorpure CoreLib, no app/render frames (a periodic timer's timestamp path).
  • The GC at the fatal point is routine: e.g. Crash C was BEGINGC ... requested generation = 1 — an ordinary periodic gen-1, ~1 GC / 4 s, ~35 MB heap.

Minimal characterization of the invariant

During a routine GC, while the main UI thread is suspended inside the Win32 DispatchMessage(MSG) P/Invoke frame being root-scanned, the runtime hijacks the return address of another thread stopped at a fullyInt=0 JIT point and immediately enters Debugger::HandleFatalError → raises ExecutionEngineExceptionRaiseFailFastException.

The hijack-victim thread's work is incidental (layout in one crash, DateTimeOffset ctor in another) — the constant is the suspend/hijack step itself faulting, with the pump P/Invoke frame present in the root scan.

Disabling CET stops it — confirmed

  • Building the app with <CetCompat>false</CetCompat> (opts the exe out of CET shadow stacks; apphost PE CET_COMPAT bit verified flipped 1→0 via dumpbin /headers) eliminates the crash: the CET-off build ran 125+ minutes with zero EEE and was still healthy, versus an 8–80 minute crash baseline on every CET-on build (3 dumps, 2 binaries, GPU and software render).
  • This points at CET hardware shadow-stack enforcement as the trigger: a GC return-address hijack sets a thread's return address to a stub that is not on the shadow stack, which CET treats as a control-flow violation and terminates the process. The runtime has a CET-safe suspension path using special user-mode APCs (QueueUserAPC2); this looks like a case where the hijack path is taken while shadow stacks are active.
  • <CetCompat>false> is a usable mitigation but disables a security feature for the whole process — an in-runtime fix (always take the CET-safe suspension path when shadow stacks are enabled) would be preferable.

Ruled out (with evidence)

  1. GPU / Skia / ANGLE render path — exonerated. Forced CPU software rendering (no av_libGLESv2.dll / av_libEGL.dll / nvwgf2umx.dll / d3d11.dll / dxgi.dll / opengl32.dll / vulkan-1.dll loaded — verified by module-list diff vs the GPU-mode dumps) still crashed with the identical mechanism in ~8 min, on a thread doing DateTimeOffset construction with zero render frames.
  2. Allocation pressure — exonerated. Fatal GCs are routine periodic gen-0/gen-1 at ~1/4 s with a ~35 MB heap; a low-allocation run crashed faster than a high-allocation one. No GC heap corruption (verifyheap clean).
  3. Application exception storm — exonerated. An earlier high-rate InvalidDataException storm (an app-side deserialization bug, unrelated to the runtime) was fixed; with it gone (InvalidDataException = 0 across a 273 MB StressLog) the EEE still reproduced. Background exception rate at crash is ~1/8 s, not a storm.

Environment

  • .NET: 10.0.8 (DAC 10.0.826.23019)
  • OS: Windows 11, build 26200 (x64)
  • UI framework: Avalonia 12, Win32 windowing backend, classic desktop lifetime, single-threaded WPF-style UI message pump
  • TieredCompilation already disabled; reproduces with and without tiering knobs.
  • App is mostly idle at crash time (periodic timers + UI message pump + routine GC).

Possibly related

Repro assets available on request

  • 3 full minidumps (Type=4) with armed StressLog (DOTNET_StressLog=1, 128 MB/thread ring); the two storm-free dumps are cleanest (~273 MB each, can be trimmed to the fatal window).
  • clrstack / clrthreads / threads / module-list captures per dump.

Questions for the runtime team

  1. Is this a known issue in the return-address hijack / EEPolicy::HandleFatalError path on Windows when a thread is suspended at a fullyInt=0 point while another thread's IL_STUB_PInvoke(MSG ByRef) frame is being root-scanned?
  2. CET: when shadow stacks are enabled, under what conditions can the GC take the hijack suspension path (return-address overwrite) instead of the CET-safe QueueUserAPC2 path? We've confirmed <CetCompat>false> stops the crash — is that the recommended mitigation on 10.0.8, or is an in-runtime fix planned?
  3. Do any suspension-timing knobs change time-to-crash as a race corroborator: DOTNET_gcServer=1, DOTNET_GCgen0size=<larger>, DOTNET_TieredPGO=0, DOTNET_JITMinOpts=1? (DOTNET_ThreadSuspendInjection is Unix-only and does not apply on Windows.)
  4. Is there a recommended mitigation for desktop apps on 10.0.8 short of pinning an earlier runtime, other than opting out of CET?

Suggested areas: area-VM-coreclr, area-GC-coreclr.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions