Description
A long-running desktop app (Avalonia 12, Win32 windowing backend) reliably fail-fasts with a runtime-raised System.ExecutionEngineException (HRESULT 0x80131506, faulting module coreclr.dll) every ~8–80 minutes of idle/normal use. The exception is raised by the runtime itself from Debugger::HandleFatalError → RaiseFailFastException, so it bypasses AppDomain.UnhandledException and cannot be caught.
Using DOTNET_StressLog=1 + a full minidump (DOTNET_DbgEnableMiniDump=1, DbgMiniDumpType=4) I captured the runtime's own fatal-event trail across three independent crashes. All three show the same mechanism: the GC's thread-suspension / return-address-hijack machinery raises the fatal error while suspending a thread stopped at a non-fully-interruptible JIT point (fullyInt=0), while the main UI thread is being GC-root-scanned inside the Win32 message pump's DispatchMessage(MSG) P/Invoke frame, during a routine (not pressure-driven) GC.
Disabling CET (<CetCompat>false</CetCompat>) stops the crash completely (confirmed by a 2+ hour soak — see below), which strongly implicates CET hardware shadow-stack enforcement colliding with the GC's return-address hijack.
Frequency / Severity
- Reproduces every ~8–80 minutes of runtime. Fatal (process dies); no managed handler can intercept.
- Reproduced across two different builds of the app and two render modes (GPU/ANGLE and CPU/software) — 3 captured dumps total.
StressLog — the fatal sequence (representative; identical shape in all 3 dumps)
<tid> <ts> : SYNC Stopped in Jitted code at pc=... sp=... fullyInt=0
<tid> <ts> : SYNC Hijacking return address ... for thread ...
<tid> <ts> : CORDB D::HFE: About to call LogFatalError <-- FATAL
<tid> <ts> : EH SetLastThrownObject: obj=... (ExecutionEngineException)
<tid> <ts> : CORDB D::RFFE: About to call RaiseFailFastException <-- FAIL-FAST
<tid> <ts> : SYNC Thread::SuspendAllThreads() - Success
- Exactly one fatal event per crash. Thrown object is the
System.ExecutionEngineException singleton (same MethodTable across all 3 crashes), _message=null, HResult=0x80131506, exception code 0xE0434352 → runtime-raised, not a managed throw.
- At every
SuspendAllThreads (including the fatal one), the main UI thread shows Scanning Frameless method IL_STUB_PInvoke(MSG ByRef) — i.e. it is parked in the Win32 GetMessage/DispatchMessage(MSG) P/Invoke and being root-scanned.
- The thread being hijacked at the fatal moment varies by crash and is doing ordinary work:
- Crash A: main UI thread mid-
Measure (layout).
- Crash B: a non-main thread.
- Crash C: a thread inside
System.DateTimeOffset.ValidateOffset / ..ctor — pure CoreLib, no app/render frames (a periodic timer's timestamp path).
- The GC at the fatal point is routine: e.g. Crash C was
BEGINGC ... requested generation = 1 — an ordinary periodic gen-1, ~1 GC / 4 s, ~35 MB heap.
Minimal characterization of the invariant
During a routine GC, while the main UI thread is suspended inside the Win32 DispatchMessage(MSG) P/Invoke frame being root-scanned, the runtime hijacks the return address of another thread stopped at a fullyInt=0 JIT point and immediately enters Debugger::HandleFatalError → raises ExecutionEngineException → RaiseFailFastException.
The hijack-victim thread's work is incidental (layout in one crash, DateTimeOffset ctor in another) — the constant is the suspend/hijack step itself faulting, with the pump P/Invoke frame present in the root scan.
Disabling CET stops it — confirmed
- Building the app with
<CetCompat>false</CetCompat> (opts the exe out of CET shadow stacks; apphost PE CET_COMPAT bit verified flipped 1→0 via dumpbin /headers) eliminates the crash: the CET-off build ran 125+ minutes with zero EEE and was still healthy, versus an 8–80 minute crash baseline on every CET-on build (3 dumps, 2 binaries, GPU and software render).
- This points at CET hardware shadow-stack enforcement as the trigger: a GC return-address hijack sets a thread's return address to a stub that is not on the shadow stack, which CET treats as a control-flow violation and terminates the process. The runtime has a CET-safe suspension path using special user-mode APCs (
QueueUserAPC2); this looks like a case where the hijack path is taken while shadow stacks are active.
<CetCompat>false> is a usable mitigation but disables a security feature for the whole process — an in-runtime fix (always take the CET-safe suspension path when shadow stacks are enabled) would be preferable.
Ruled out (with evidence)
- GPU / Skia / ANGLE render path — exonerated. Forced CPU software rendering (no
av_libGLESv2.dll / av_libEGL.dll / nvwgf2umx.dll / d3d11.dll / dxgi.dll / opengl32.dll / vulkan-1.dll loaded — verified by module-list diff vs the GPU-mode dumps) still crashed with the identical mechanism in ~8 min, on a thread doing DateTimeOffset construction with zero render frames.
- Allocation pressure — exonerated. Fatal GCs are routine periodic gen-0/gen-1 at ~1/4 s with a ~35 MB heap; a low-allocation run crashed faster than a high-allocation one. No GC heap corruption (
verifyheap clean).
- Application exception storm — exonerated. An earlier high-rate
InvalidDataException storm (an app-side deserialization bug, unrelated to the runtime) was fixed; with it gone (InvalidDataException = 0 across a 273 MB StressLog) the EEE still reproduced. Background exception rate at crash is ~1/8 s, not a storm.
Environment
- .NET: 10.0.8 (DAC 10.0.826.23019)
- OS: Windows 11, build 26200 (x64)
- UI framework: Avalonia 12, Win32 windowing backend, classic desktop lifetime, single-threaded WPF-style UI message pump
TieredCompilation already disabled; reproduces with and without tiering knobs.
- App is mostly idle at crash time (periodic timers + UI message pump + routine GC).
Possibly related
Repro assets available on request
- 3 full minidumps (Type=4) with armed StressLog (
DOTNET_StressLog=1, 128 MB/thread ring); the two storm-free dumps are cleanest (~273 MB each, can be trimmed to the fatal window).
clrstack / clrthreads / threads / module-list captures per dump.
Questions for the runtime team
- Is this a known issue in the return-address hijack /
EEPolicy::HandleFatalError path on Windows when a thread is suspended at a fullyInt=0 point while another thread's IL_STUB_PInvoke(MSG ByRef) frame is being root-scanned?
- CET: when shadow stacks are enabled, under what conditions can the GC take the hijack suspension path (return-address overwrite) instead of the CET-safe
QueueUserAPC2 path? We've confirmed <CetCompat>false> stops the crash — is that the recommended mitigation on 10.0.8, or is an in-runtime fix planned?
- Do any suspension-timing knobs change time-to-crash as a race corroborator:
DOTNET_gcServer=1, DOTNET_GCgen0size=<larger>, DOTNET_TieredPGO=0, DOTNET_JITMinOpts=1? (DOTNET_ThreadSuspendInjection is Unix-only and does not apply on Windows.)
- Is there a recommended mitigation for desktop apps on 10.0.8 short of pinning an earlier runtime, other than opting out of CET?
Suggested areas: area-VM-coreclr, area-GC-coreclr.
Description
A long-running desktop app (Avalonia 12, Win32 windowing backend) reliably fail-fasts with a runtime-raised
System.ExecutionEngineException(HRESULT0x80131506, faulting modulecoreclr.dll) every ~8–80 minutes of idle/normal use. The exception is raised by the runtime itself fromDebugger::HandleFatalError→RaiseFailFastException, so it bypassesAppDomain.UnhandledExceptionand cannot be caught.Using
DOTNET_StressLog=1+ a full minidump (DOTNET_DbgEnableMiniDump=1,DbgMiniDumpType=4) I captured the runtime's own fatal-event trail across three independent crashes. All three show the same mechanism: the GC's thread-suspension / return-address-hijack machinery raises the fatal error while suspending a thread stopped at a non-fully-interruptible JIT point (fullyInt=0), while the main UI thread is being GC-root-scanned inside the Win32 message pump'sDispatchMessage(MSG)P/Invoke frame, during a routine (not pressure-driven) GC.Disabling CET (
<CetCompat>false</CetCompat>) stops the crash completely (confirmed by a 2+ hour soak — see below), which strongly implicates CET hardware shadow-stack enforcement colliding with the GC's return-address hijack.Frequency / Severity
StressLog — the fatal sequence (representative; identical shape in all 3 dumps)
System.ExecutionEngineExceptionsingleton (same MethodTable across all 3 crashes),_message=null,HResult=0x80131506, exception code0xE0434352→ runtime-raised, not a managedthrow.SuspendAllThreads(including the fatal one), the main UI thread showsScanning Frameless method IL_STUB_PInvoke(MSG ByRef)— i.e. it is parked in the Win32GetMessage/DispatchMessage(MSG)P/Invoke and being root-scanned.Measure(layout).System.DateTimeOffset.ValidateOffset/..ctor— pure CoreLib, no app/render frames (a periodic timer's timestamp path).BEGINGC ... requested generation = 1— an ordinary periodic gen-1, ~1 GC / 4 s, ~35 MB heap.Minimal characterization of the invariant
The hijack-victim thread's work is incidental (layout in one crash,
DateTimeOffsetctor in another) — the constant is the suspend/hijack step itself faulting, with the pump P/Invoke frame present in the root scan.Disabling CET stops it — confirmed
<CetCompat>false</CetCompat>(opts the exe out of CET shadow stacks; apphost PECET_COMPATbit verified flipped 1→0 viadumpbin /headers) eliminates the crash: the CET-off build ran 125+ minutes with zero EEE and was still healthy, versus an 8–80 minute crash baseline on every CET-on build (3 dumps, 2 binaries, GPU and software render).QueueUserAPC2); this looks like a case where the hijack path is taken while shadow stacks are active.<CetCompat>false>is a usable mitigation but disables a security feature for the whole process — an in-runtime fix (always take the CET-safe suspension path when shadow stacks are enabled) would be preferable.Ruled out (with evidence)
av_libGLESv2.dll/av_libEGL.dll/nvwgf2umx.dll/d3d11.dll/dxgi.dll/opengl32.dll/vulkan-1.dllloaded — verified by module-list diff vs the GPU-mode dumps) still crashed with the identical mechanism in ~8 min, on a thread doingDateTimeOffsetconstruction with zero render frames.verifyheapclean).InvalidDataExceptionstorm (an app-side deserialization bug, unrelated to the runtime) was fixed; with it gone (InvalidDataException= 0 across a 273 MB StressLog) the EEE still reproduced. Background exception rate at crash is ~1/8 s, not a storm.Environment
TieredCompilationalready disabled; reproduces with and without tiering knobs.Possibly related
0x80131506/coreclr.dll/0xc0000005, 1–10 min after launch, worse with more RAM, escalated after Windows 11 24H2. Open,tracking-external-issue, no working workaround. Closest match; the 24H2 correlation implicates the OS CET/shadow-stack/APC change.!AreCetShadowStacksEnabled() || UseSpecialUserModeApc()assertion when embedding CoreCLR #83437 — the runtime assertion!AreCetShadowStacksEnabled() || UseSpecialUserModeApc()(encodes "shadow stacks on ⇒ must be on the special-APC suspension path").QueueUserAPC2faults on older CPUs (the CET-safe APC path's own failure mode).Repro assets available on request
DOTNET_StressLog=1, 128 MB/thread ring); the two storm-free dumps are cleanest (~273 MB each, can be trimmed to the fatal window).clrstack/clrthreads/threads/ module-list captures per dump.Questions for the runtime team
EEPolicy::HandleFatalErrorpath on Windows when a thread is suspended at afullyInt=0point while another thread'sIL_STUB_PInvoke(MSG ByRef)frame is being root-scanned?QueueUserAPC2path? We've confirmed<CetCompat>false>stops the crash — is that the recommended mitigation on 10.0.8, or is an in-runtime fix planned?DOTNET_gcServer=1,DOTNET_GCgen0size=<larger>,DOTNET_TieredPGO=0,DOTNET_JITMinOpts=1? (DOTNET_ThreadSuspendInjectionis Unix-only and does not apply on Windows.)Suggested areas:
area-VM-coreclr,area-GC-coreclr.