Debugging a crashed kernel on opensuse

 2024-04-08

Recently my work laptop has been having kernel panics and has forced me to reset the system. While in the past it was extremely stable I have been dealing with multiple kernel panics in the past few weeks. The first step really is to identify if you really are having a kernel panic. If you have a desktop environment it may appear as if your desktop has simply hung. On linux if the kernel panics the caps lock & scroll lock key will blink rapidly i.e. the blink keys of death.

Kdump or Pstore

One you know you are having kernel panics we need to enable either enable kdump or pstore. Kdump is the main way to recover debugging information when the kernel has an issue. But in some cases if the kernel panic happens before the system is able to load disk drivers it is impossible for kdump to save its data. in this case pstore is a good option. In most cases you will need kdump. To enable it I would recommend using yast , you can either search for yast kdump in your application launcher (e.g. kickoff on kde plasma) or run sudo yast2 kdump which will launch the TUI version of yast. Here you can enable kdump and configure it to your liking. By default kdump saves files to /var/crash and going forward I will be assuming that you are saving to this location.

triggering the kernel panic

At this point you need to reboot the system and wait for the kernel to panic if you know how to trigger it, you can. Overall a lot of this is really just hoping the kernel panic happens if you do not know how to reproduce it. At this point the kernel will panic, if you have a encrypted disk you may have to unlock the disk at this point the kdump will save everything and reboot.

crash, debuginfo and gdb

At this point we need to install a few packages. first is crash, it allows you to debug netdump, diskdumps, LKCD and many more debug formats. Second we need the debug information of the kernel. To do this we need to enable the opensuse oss debug repos. To do this you can run sudo zypper mr --enable openSUSE:repo-oss-debug. Now we can run sudo zypper in kernel-default-debuginfo which will fetch to debuginfo(note you can should get the debug info of the kernel you are using).

Now we can start to debug the kernel crash. the directory structure of /var/crash will be something like

/var/crash
└── 2024-03-23-16-30
   ├──dmesg
   ├──README.txt
   └──vmcore

So at this point we can start by looking at the README.txt. This contains information such as the status of various dumps eg dmesg, the dump time hostname and other info. It looks something like

Kernel crashdump
----------------
dmesg status: saved successfully
vmcore status: saved successfully
Kernel version: 6.8.1-1-default
Crash time: 2024-03-23T10:59:41
Dump time: 2024-03-23T16:30:04
Host: aerial
Dump level: 31
Dump format: compressed

This shows us, that the dmesg and vmcore has been saved. Now we can look at the dmesg get some hints about what happened. Some portion has been removed due to just being normal system data

[...]
[10657.388836] BUG: unable to handle page fault for address: 00000000000035a3
[10657.388853] #PF: supervisor read access in kernel mode
[10657.388859] #PF: error_code(0x0000) - not-present page
[10657.388864] PGD 0 P4D 0 
[10657.388870] Oops: 0000 [#1] PREEMPT SMP NOPTI
[10657.388876] CPU: 3 PID: 4288 Comm: kworker/3:2 Kdump: loaded Tainted: P           OE      6.8.1-1-default #1 openSUSE Tumbleweed a408dede100ecd8172a7eae2d0778227ac69e46d
[10657.388885] Hardware name: LENOVO 21AAS0R100/21AAS0R100, BIOS N38ET43W (1.24 ) 11/14/2023
[10657.388890] Workqueue: cgroup_destroy css_free_rwork_fn
[10657.388901] RIP: 0010:rb_first+0xf/0x30
[10657.388910] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
[10657.388917] RSP: 0000:ffffabe5a583fde0 EFLAGS: 00010206
[10657.388923] RAX: 0000000000003593 RBX: ffff8e18842b7780 RCX: ffffcd4cdc678a90
[10657.388928] RDX: 0000000000003593 RSI: 0000000000000000 RDI: ffff8e1886681498
[10657.388932] RBP: ffff8e22b4bd9d80 R08: 0000000000000246 R09: 00000000010000ff
[10657.388936] R10: 00000000010000ff R11: fefefefefefefeff R12: 0000000000000000
[10657.388941] R13: ffff8e1886681498 R14: ffff8e19216b3000 R15: ffff8e18817e8ab0
[10657.388945] FS:  0000000000000000(0000) GS:ffff8e23af180000(0000) knlGS:0000000000000000
[10657.388950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10657.388955] CR2: 00000000000035a3 CR3: 00000002ee03a002 CR4: 0000000000f70ef0
[10657.388960] PKRU: 55555554
[10657.388963] Call Trace:
[10657.388968]  <TASK>
[10657.388974]  ? __die+0x23/0x70
[10657.388982]  ? page_fault_oops+0x14d/0x490
[10657.388991]  ? exc_page_fault+0x71/0x160
[10657.388997]  ? asm_exc_page_fault+0x26/0x30
[10657.389007]  ? rb_first+0xf/0x30
[10657.389012]  simple_xattrs_free+0x29/0x90
[10657.389020]  kernfs_put.part.0+0x60/0x150
[10657.389029]  css_free_rwork_fn+0x125/0x3c0
[10657.389035]  process_one_work+0x165/0x330
[10657.389042]  worker_thread+0x2f5/0x410
[10657.389048]  ? __pfx_worker_thread+0x10/0x10
[10657.389053]  kthread+0xe5/0x120
[10657.389060]  ? __pfx_kthread+0x10/0x10
[10657.389066]  ret_from_fork+0x31/0x50
[10657.389073]  ? __pfx_kthread+0x10/0x10
[10657.389078]  ret_from_fork_asm+0x1b/0x30
[10657.389086]  </TASK>
[...]

so here we can see there was a pagefault and an attempt at a backtrace. So now it's time for crash. crash is a super powerful tool that allows us to debug the kernel crash. to start off we need to give crash, the debuginfo, the vmlinux file and the vmcore file. To do this it as simple as running crash /usr/lib/debug/usr/lib/modules/6.8.1-1-default/vmlinux.debug /usr/lib/modules/6.8.1-1-default/vmlinux.xz ./vmcore This gets us a gdb session we can use to debug the kernel panic. To start with we can run sys to get an idea about the system. Now we can simply run bt which will backtrace the panicked kernel.

PID: 4288     TASK: ffff8e18c865af40  CPU: 3    COMMAND: "kworker/3:2"
#0 [ffffabe5a583fb48] machine_kexec at ffffffffa108fe0f
#1 [ffffabe5a583fba0] __crash_kexec at ffffffffa11c7e0e
#2 [ffffabe5a583fc60] crash_kexec at ffffffffa11c92a4
#3 [ffffabe5a583fc68] oops_end at ffffffffa10457c4
#4 [ffffabe5a583fc88] page_fault_oops at ffffffffa10a53f1
#5 [ffffabe5a583fd08] exc_page_fault at ffffffffa1ccfc21
#6 [ffffabe5a583fd30] asm_exc_page_fault at ffffffffa1e012a6
   [exception RIP: rb_first+15]
   RIP: ffffffffa1cb7a8f  RSP: ffffabe5a583fde0  RFLAGS: 00010206
   RAX: 0000000000003593  RBX: ffff8e18842b7780  RCX: ffffcd4cdc678a90
   RDX: 0000000000003593  RSI: 0000000000000000  RDI: ffff8e1886681498
   RBP: ffff8e22b4bd9d80   R8: 0000000000000246   R9: 00000000010000ff
   R10: 00000000010000ff  R11: fefefefefefefeff  R12: 0000000000000000
   R13: ffff8e1886681498  R14: ffff8e19216b3000  R15: ffff8e18817e8ab0
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
#7 [ffffabe5a583fde0] simple_xattrs_free at ffffffffa1453ba9
#8 [ffffabe5a583fe08] kernfs_put at ffffffffa14dd160
#9 [ffffabe5a583fe30] css_free_rwork_fn at ffffffffa11d59d5
#10 [ffffabe5a583fe68] process_one_work at ffffffffa10e8635
#11 [ffffabe5a583fea8] worker_thread at ffffffffa10e97e5
#12 [ffffabe5a583fef8] kthread at ffffffffa10f38c5
#13 [ffffabe5a583ff30] ret_from_fork at ffffffffa1050f41
#14 [ffffabe5a583ff50] ret_from_fork_asm at ffffffffa100383b

At this point we have some a backtrace a pid of a process. In some case you can run ps | grep 4288 which will give us the a little more information about the crash. In this case it occurred inside a kernel worker however in some cases it could have been caused by another process (e.g php) from here we can look up the symbols for in the backtrace. In this case we can use simple_xattrs_free this very quickly lets us find the fs/xattr.c file in the linux kernel source code and points us towards a fs bug in this case. In my case it turned out someone had already reported the bug. I hope someone in the future finds this useful to identify and debug similar problems