Skip to content

Core dump experimentation#3370

Draft
petejohanson wants to merge 4 commits into
zmkfirmware:mainfrom
petejohanson:core-dump-experimentation
Draft

Core dump experimentation#3370
petejohanson wants to merge 4 commits into
zmkfirmware:mainfrom
petejohanson:core-dump-experimentation

Conversation

@petejohanson

@petejohanson petejohanson commented May 30, 2026

Copy link
Copy Markdown
Contributor

Creating this as a draft for now. I will be considering whether these new board variants should be merged at a later date, but the immediate goal is to try to allow postmortem analysis of faults with ZMK, like the one in #3195 and the reported central issues in ZMK since the upgrade of Zephyr. Anyone who's been hitting that, I implore you to try these steps to help us track down that lingering issue.

Testing Steps

Edit: Added entry to change GHA to include the .elf file in the build zip

  • If building with GHA, update your repo's .github/workflows/build.yml to point to petejohanson/zmk/.github/workflows/build-user-config.yml@core-dump-experimentation
  • Build your usual firmware, but with the zmk_debug variant
  • After a fault/freeze, reset the device into the Adafruit bootloader.
  • Grab the full UF2 contents from the device, by copying CURRENT.UF2 from the mass storage device from your host.
  • Attach it here, with a link to the exact GitHub workflow that generated your firmware, or if building locally, also attach the zmk.elf file from your build you used.

Debugging Steps

  1. Convert the UF2 file to a .bin file: python ../zephyr/scripts/build/uf2conv.py ~/Downloads/xiao-tester-crash.uf2 --output ~/Documents/xiao-tester-crash.bin
  2. Extract the actual coredump from the entire flash dump, using the attached script (which I will be upstreaming to the Zephyr project once others validate): python ../zephyr/scripts/coredump/coredump_flash_dump_extractor.py ~/Documents/xiao-tester-crash.bin 0x70000 ~/Documents/xiao_tester_dump.bin (The 0x70000 offset matches the coredump partition offset, but with some bootloader applied 0x1000 offset applied)
  3. Run the debug server for the dump: python ../zephyr/scripts/coredump/coredump_gdbserver.py build/xiao_tester_debug/zephyr/zmk.elf ~/Documents/xiao_tester_dump.bin
  4. In a separate process, actually run GDB: gdb zmk.elf then in the session, run target remote localhost:1234

The later steps are a reproduction of the docs from https://docs.zephyrproject.org/4.1.0/services/debugging/coredump.html

coredump_flash_dump_extractor.py

PR check-list

  • Branch has a clean commit history
  • Additional tests are included, if changing behaviors/core code that is testable.
  • Proper Copyright + License headers added to applicable files (Generally, we stick to "The ZMK Contributors" for copyrights to help avoid churn when files get edited)
  • Pre-commit used to check formatting of files, commit messages, etc.
  • Includes any necessary documentation changes.

Add a new board variant for the XIAO nRF52840 that properly enabled
coredumps to a dedicated flash partition, for postmortem analysis of
hard to reproduce crashes.
Add a new board variant for the nice!nano that properly enables
coredumps to a dedicated flash partition, for postmortem analysis of
hard to reproduce crashes.
@petejohanson petejohanson self-assigned this May 30, 2026
@petejohanson petejohanson added board PRs and issues related to boards. core Core functionality/behavior of ZMK labels May 30, 2026
…pace

Add an additional variant that allocates more space for the ZMK firmware
before the coredump partition.
@SethMilliken

SethMilliken commented May 30, 2026

Copy link
Copy Markdown
Contributor

2026-05-30-14-11-araxia-dump.zip
Using firmware from https://github.com/SethMilliken/zmk-config/actions/runs/26693278058

Edit: It's possible I may have used the incorrect offset, so this dump may not be usable. (I had used hexdump to find a line that started with the expected header id, but there were multiple ostensible matches and the one I selected may not match up with the rest of the header fields).

Added this log line to the extractor script after the unpack:

print(f'id1: {id1} id2: {id2} hdr_ver: {hdr_ver} size: {size} flags: {flags} checksum: {checksum} err: {err}')

And got:

id1: b'C' id2: b'D' hdr_ver: 1 size: 65504 flags: 0 checksum: 8485 err: 0

This is the zmk_debug_extra variant, using offset 0x98000.

@caksoylar

Copy link
Copy Markdown
Contributor

I managed to reproduce on my Xiao split, zip has the firmware elf, the uf2 extracted and extracted dump bins: xiao-tester-crash.zip

Managed to get a debugger this way, by using the arm gdb from Zephyr SDK:

# my west is pipx-installed, use python from its venv
$ ~/.local/pipx/venvs/west/bin/python ../zephyr/scripts/coredump/coredump_gdbserver.py zmk.elf xiao_tester_dump.bin &

# use arm gdb
$ ~/zephyr-sdk-0.16.3/arm-zephyr-eabi/bin/arm-zephyr-eabi-gdb zmk.elf -ex "target remote localhost:1234"
GNU gdb (Zephyr SDK 0.16.3) 12.1
[...]
Reading symbols from build/rommana_left-rgbled_adapter-debug/zephyr/zmk.elf...
Remote debugging using localhost:1234
lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_cb@entry=0x2a2b9 <prepare_cb>,
    prepare_param=0x0, prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>, is_resume=is_resume@entry=0 '\000',
    is_dequeue=is_dequeue@entry=0 '\000')
    at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
894                     if (is_resume) {
(gdb) bt
#0  lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_cb@entry=0x2a2b9 <prepare_cb>, prepare_param=0x0, prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>, is_resume=is_resume@entry=0 '\000',
    is_dequeue=is_dequeue@entry=0 '\000') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
#1  0x0003d608 in lll_prepare (is_abort_cb=<optimized out>, abort_cb=<optimized out>, prepare_cb=prepare_cb@entry=0x2a2b9 <prepare_cb>, event_prio=event_prio@entry=0 '\000',
    prepare_param=prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/lll_common.c:65
#2  0x0004a314 in lll_periph_prepare (param=0x20007ab8 <buffer_mem_tx+932>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_peripheral.c:83
#3  0x0003a972 in mayfly_run (callee_id=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:217
#4  0x000318f6 in _isr_wrapper () at /home/cem/zmk/zephyr/arch/arm/core/cortex_m/isr_wrapper.c:77
#5  <signal handler called>
#6  0x0004b078 in mayfly_pend (caller_id=caller_id@entry=1 '\001', callee_id=callee_id@entry=0 '\000') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/hal/nrf5/mayfly.c:126
#7  0x0003a824 in mayfly_enqueue (caller_id=caller_id@entry=1 '\001', callee_id=callee_id@entry=0 '\000', chain=1 '\001', chain@entry=0 '\000', m=0x20000bfc <mfy+12>, m@entry=0x20000c68 <mfy>)
    at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:125
#8  0x00046456 in ull_periph_ticker_cb (ticks_drift=<optimized out>, param=0x20003ba8 <prio_recv_thread_data+224>, force=0 '\000', lazy=0, remainder=12810048, ticks_at_expire=2328441)
    at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/ull_peripheral.c:586
#9  ull_periph_ticker_cb (ticks_at_expire=2328441, ticks_drift=<optimized out>, remainder=12810048, lazy=<optimized out>, force=0 '\000', param=0x20003ba8 <prio_recv_thread_data+224>)
    at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/ull_peripheral.c:522
#10 0x0003b01c in ticker_worker (param=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ticker/ticker.c:1472
#11 0x0003a972 in mayfly_run (callee_id=callee_id@entry=1 '\001') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:217
#12 0x00047c60 in rtc0_nrf5_isr (arg=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:163
#13 0x000318f6 in _isr_wrapper () at /home/cem/zmk/zephyr/arch/arm/core/cortex_m/isr_wrapper.c:77
warning: no PSP thread stack unwinding supported.
#14 <signal handler called>
warning: no PSP thread stack unwinding supported.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0x3                 3
r1             0x0                 0
r2             0x21                33
r3             0x3                 3
r4             0x0                 0
r5             0x0                 0
r6             0x0                 0
r7             0x0                 0
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x20000878          536873080
sp             0x2000d2b0          0x2000d2b0 <z_interrupt_stacks+1584>
lr             0x487e3             296931
pc             0x487f2             0x487f2 <lll_prepare_resolve+446>
xpsr           0x21000028          553648168
(gdb)

To help with debugging, include the zmk.elf file in the artifacts for a
user build.
@petejohanson

Copy link
Copy Markdown
Contributor Author

2026-05-30-14-11-araxia-dump.zip Using firmware from https://github.com/SethMilliken/zmk-config/actions/runs/26693278058

Edit: It's possible I may have used the incorrect offset, so this dump may not be usable. (I had used hexdump to find a line that started with the expected header id, but there were multiple ostensible matches and the one I selected may not match up with the rest of the header fields).

Added this log line to the extractor script after the unpack:

print(f'id1: {id1} id2: {id2} hdr_ver: {hdr_ver} size: {size} flags: {flags} checksum: {checksum} err: {err}')

And got:

id1: b'C' id2: b'D' hdr_ver: 1 size: 65504 flags: 0 checksum: 8485 err: 0

This is the zmk_debug_extra variant, using offset 0x98000.

Thanks! I updated the description to include steps to update your workflow to archive the .elf file we need to debug. Can you please re-run and reproduce with a build where I can snag the .elf file? Thanks!

@SethMilliken

Copy link
Copy Markdown
Contributor

Thanks! I updated the description to include steps to update your workflow to archive the .elf file we need to debug. Can you please re-run and reproduce with a build where I can snag the .elf file?

Built and installed. Now waiting impatiently for my keyboard to lock-up.

@SethMilliken

Copy link
Copy Markdown
Contributor

Got gdb working this time. This one looks like maybe a null pointer or some other memory issue in lll.c. backtrace shows this line, with the earlier trace showing is_resume=<error reading variable: Cannot access memory at address 0x20011740>.

#0  lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_param=0x0,
    is_resume=<error reading variable: Cannot access memory at address 0x20011740>,
    is_dequeue=<error reading variable: Cannot access memory at address 0x20011744>)
    at /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
894     in /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c

LMK if there are any gdb invocations that would provide additional useful output.

@petejohanson

Copy link
Copy Markdown
Contributor Author

Got gdb working this time. This one looks like maybe a null pointer or some other memory issue in lll.c. backtrace shows this line, with the earlier trace showing is_resume=<error reading variable: Cannot access memory at address 0x20011740>.

#0  lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_param=0x0,
    is_resume=<error reading variable: Cannot access memory at address 0x20011740>,
    is_dequeue=<error reading variable: Cannot access memory at address 0x20011744>)
    at /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
894     in /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c

LMK if there are any gdb invocations that would provide additional useful output.

* [2026-05-31_13-09-06-araxia-dump.zip](https://github.com/user-attachments/files/28444605/2026-05-31_13-09-06-araxia-dump.zip)

* [firmware build here](https://github.com/SethMilliken/zmk-config/actions/runs/26701895641)

The good news: This is the exact same fault point that @caksoylar reported, so we can at least be confident you're both experiencing the same bug. I will review both and see what else might be useful.

@petejohanson

Copy link
Copy Markdown
Contributor Author

For everyone who has hit the issue, can you please give my core-dump-experimentation-with-lll-cherry-picks branch a try? It pulls in a version of Zephyr with various upstream controller fixes cherry picked in. Thanks!

@SethMilliken

Copy link
Copy Markdown
Contributor

I tried the lll cherry picks branch, and my boards can't even get past connection now before locking up. I've had to revert to be able to type at all.

It did generate core dumps though, and here is a backtrace:

lll_conn_peripheral_is_abort_cb (next=<optimized out>, curr=<optimized out>, resume_cb=<optimized out>) at /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:228
warning: 228	/__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c: No such file or directory

Attached uf2 for use with zmk.elf from https://github.com/SethMilliken/zmk-config/actions/runs/27086366160

2026-06-07-araxia-lll-cp-extra-nicehat.tgz

@petejohanson

Copy link
Copy Markdown
Contributor Author

@SethMilliken Can you try again? I've updated it to use a Zephyr branch with a small set of selective commits suggested by @cvinayak

@SethMilliken

Copy link
Copy Markdown
Contributor

Firmware with selective cherry picks now installed on my two main boards. I'll be doing a lot of typing tonight. Fingers crossed (but only for activating combos).

@SethMilliken

Copy link
Copy Markdown
Contributor

Two lockups already with the new selective cherry picks firmware, both with the same backtrace:

lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_param=0x0, is_resume=<error reading variable: Cannot access memory at address 0x20011740>, is_dequeue=<error reading variable: Cannot access memory at address 0x20011744>) at /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:924
warning: 924	/__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c: No such file or directory

Which looks like the same callsite as before:

>       if (is_resume) {
                return -EINPROGRESS;
        }

LMK if you want the UF2.

@cvinayak

cvinayak commented Jun 8, 2026

Copy link
Copy Markdown

@SethMilliken Can you try again? I've updated it to use a Zephyr branch with a small set of selective commits suggested by @cvinayak

In addition to the 3 commits, please include a quick check commit patch attached.
0001-quick-check.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

board PRs and issues related to boards. core Core functionality/behavior of ZMK

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants