Core dump experimentation#3370
Conversation
Add a new board variant for the XIAO nRF52840 that properly enabled coredumps to a dedicated flash partition, for postmortem analysis of hard to reproduce crashes.
Add a new board variant for the nice!nano that properly enables coredumps to a dedicated flash partition, for postmortem analysis of hard to reproduce crashes.
…pace Add an additional variant that allocates more space for the ZMK firmware before the coredump partition.
|
2026-05-30-14-11-araxia-dump.zip Edit: It's possible I may have used the incorrect offset, so this dump may not be usable. (I had used Added this log line to the extractor script after the unpack: print(f'id1: {id1} id2: {id2} hdr_ver: {hdr_ver} size: {size} flags: {flags} checksum: {checksum} err: {err}')And got: This is the |
|
I managed to reproduce on my Xiao split, zip has the firmware elf, the uf2 extracted and extracted dump bins: xiao-tester-crash.zip Managed to get a debugger this way, by using the arm gdb from Zephyr SDK: # my west is pipx-installed, use python from its venv
$ ~/.local/pipx/venvs/west/bin/python ../zephyr/scripts/coredump/coredump_gdbserver.py zmk.elf xiao_tester_dump.bin &
# use arm gdb
$ ~/zephyr-sdk-0.16.3/arm-zephyr-eabi/bin/arm-zephyr-eabi-gdb zmk.elf -ex "target remote localhost:1234"
GNU gdb (Zephyr SDK 0.16.3) 12.1
[...]
Reading symbols from build/rommana_left-rgbled_adapter-debug/zephyr/zmk.elf...
Remote debugging using localhost:1234
lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_cb@entry=0x2a2b9 <prepare_cb>,
prepare_param=0x0, prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>, is_resume=is_resume@entry=0 '\000',
is_dequeue=is_dequeue@entry=0 '\000')
at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
894 if (is_resume) {
(gdb) bt
#0 lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_cb@entry=0x2a2b9 <prepare_cb>, prepare_param=0x0, prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>, is_resume=is_resume@entry=0 '\000',
is_dequeue=is_dequeue@entry=0 '\000') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
#1 0x0003d608 in lll_prepare (is_abort_cb=<optimized out>, abort_cb=<optimized out>, prepare_cb=prepare_cb@entry=0x2a2b9 <prepare_cb>, event_prio=event_prio@entry=0 '\000',
prepare_param=prepare_param@entry=0x20007ab8 <buffer_mem_tx+932>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/lll_common.c:65
#2 0x0004a314 in lll_periph_prepare (param=0x20007ab8 <buffer_mem_tx+932>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_peripheral.c:83
#3 0x0003a972 in mayfly_run (callee_id=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:217
#4 0x000318f6 in _isr_wrapper () at /home/cem/zmk/zephyr/arch/arm/core/cortex_m/isr_wrapper.c:77
#5 <signal handler called>
#6 0x0004b078 in mayfly_pend (caller_id=caller_id@entry=1 '\001', callee_id=callee_id@entry=0 '\000') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/hal/nrf5/mayfly.c:126
#7 0x0003a824 in mayfly_enqueue (caller_id=caller_id@entry=1 '\001', callee_id=callee_id@entry=0 '\000', chain=1 '\001', chain@entry=0 '\000', m=0x20000bfc <mfy+12>, m@entry=0x20000c68 <mfy>)
at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:125
#8 0x00046456 in ull_periph_ticker_cb (ticks_drift=<optimized out>, param=0x20003ba8 <prio_recv_thread_data+224>, force=0 '\000', lazy=0, remainder=12810048, ticks_at_expire=2328441)
at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/ull_peripheral.c:586
#9 ull_periph_ticker_cb (ticks_at_expire=2328441, ticks_drift=<optimized out>, remainder=12810048, lazy=<optimized out>, force=0 '\000', param=0x20003ba8 <prio_recv_thread_data+224>)
at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/ull_peripheral.c:522
#10 0x0003b01c in ticker_worker (param=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ticker/ticker.c:1472
#11 0x0003a972 in mayfly_run (callee_id=callee_id@entry=1 '\001') at /home/cem/zmk/zephyr/subsys/bluetooth/controller/util/mayfly.c:217
#12 0x00047c60 in rtc0_nrf5_isr (arg=<optimized out>) at /home/cem/zmk/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:163
#13 0x000318f6 in _isr_wrapper () at /home/cem/zmk/zephyr/arch/arm/core/cortex_m/isr_wrapper.c:77
warning: no PSP thread stack unwinding supported.
#14 <signal handler called>
warning: no PSP thread stack unwinding supported.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0 0x3 3
r1 0x0 0
r2 0x21 33
r3 0x3 3
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x20000878 536873080
sp 0x2000d2b0 0x2000d2b0 <z_interrupt_stacks+1584>
lr 0x487e3 296931
pc 0x487f2 0x487f2 <lll_prepare_resolve+446>
xpsr 0x21000028 553648168
(gdb) |
To help with debugging, include the zmk.elf file in the artifacts for a user build.
Thanks! I updated the description to include steps to update your workflow to archive the .elf file we need to debug. Can you please re-run and reproduce with a build where I can snag the .elf file? Thanks! |
Built and installed. Now waiting impatiently for my keyboard to lock-up. |
|
Got #0 lll_prepare_resolve (is_abort_cb=0x0, abort_cb=0x0, prepare_cb=0x0, prepare_param=0x0,
is_resume=<error reading variable: Cannot access memory at address 0x20011740>,
is_dequeue=<error reading variable: Cannot access memory at address 0x20011744>)
at /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:894
894 in /__w/zmk-config/zmk-config/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c
LMK if there are any |
The good news: This is the exact same fault point that @caksoylar reported, so we can at least be confident you're both experiencing the same bug. I will review both and see what else might be useful. |
|
For everyone who has hit the issue, can you please give my |
|
I tried the It did generate core dumps though, and here is a backtrace: Attached uf2 for use with |
|
@SethMilliken Can you try again? I've updated it to use a Zephyr branch with a small set of selective commits suggested by @cvinayak |
|
Firmware with selective cherry picks now installed on my two main boards. I'll be doing a lot of typing tonight. Fingers crossed (but only for activating combos). |
|
Two lockups already with the new selective cherry picks firmware, both with the same backtrace: Which looks like the same callsite as before: LMK if you want the UF2. |
In addition to the 3 commits, please include a quick check commit patch attached. |
Creating this as a draft for now. I will be considering whether these new board variants should be merged at a later date, but the immediate goal is to try to allow postmortem analysis of faults with ZMK, like the one in #3195 and the reported central issues in ZMK since the upgrade of Zephyr. Anyone who's been hitting that, I implore you to try these steps to help us track down that lingering issue.
Testing Steps
Edit: Added entry to change GHA to include the .elf file in the build zip
.github/workflows/build.ymlto point topetejohanson/zmk/.github/workflows/build-user-config.yml@core-dump-experimentationzmk_debugvariantCURRENT.UF2from the mass storage device from your host.zmk.elffile from your build you used.Debugging Steps
python ../zephyr/scripts/build/uf2conv.py ~/Downloads/xiao-tester-crash.uf2 --output ~/Documents/xiao-tester-crash.binpython ../zephyr/scripts/coredump/coredump_flash_dump_extractor.py ~/Documents/xiao-tester-crash.bin 0x70000 ~/Documents/xiao_tester_dump.bin(The 0x70000 offset matches the coredump partition offset, but with some bootloader applied 0x1000 offset applied)python ../zephyr/scripts/coredump/coredump_gdbserver.py build/xiao_tester_debug/zephyr/zmk.elf ~/Documents/xiao_tester_dump.bingdb zmk.elfthen in the session, runtarget remote localhost:1234The later steps are a reproduction of the docs from https://docs.zephyrproject.org/4.1.0/services/debugging/coredump.html
coredump_flash_dump_extractor.py
PR check-list