-
Notifications
You must be signed in to change notification settings - Fork 9
Description
We hit an interesting case that looked very similar to part of the issue that motivated using a single periodic in I believe f514524. This manifested itself by @leftwo injecting an NMI in the system. I started by looking at his hung login process:
> ::pgrep login
S PID PPID PGID SID UID FLAGS ADDR NAME
R 549 9 549 549 0 0x4a004000 fffffcfa2f64c008 login
> ::pgrep login | ::walk thread
0xfffffcfa37f94080
> ::pgrep login | ::walk thread | ::findstack -v
stack pointer for thread fffffcfa37f94080 (login/1): fffff788103c1ca0
[ fffff788103c1ca0 _resume_from_idle+0x12b() ]
fffff788103c1cd0 swtch+0x139()
fffff788103c1d70 turnstile_block+0x279(0, 0, fffffcfac0d6b138, fffffffffc0185c0, 0, 0)
fffff788103c1de0 mutex_vector_enter+0x378(fffffcfac0d6b138)
fffff788103c1e80 tasksys_settaskid+0x2b2(1, 0)
fffff788103c1ec0 tasksys+0x82(0, 1, 0, 0, 0)
fffff788103c1f10 sys_syscall32+0x105()
> fffffcfac0d6b138::mutex
ADDR TYPE HELD MINSPL OLDSPL WAITERS
fffffcfac0d6b138 adapt fffffcfada52cbc0 - - yes
> fffffcfada52cbc0::findstack -v
stack pointer for thread fffffcfada52cbc0 (cron/1): fffff7882b3dfca0
[ fffff7882b3dfca0 _resume_from_idle+0x12b() ]
fffff7882b3dfcd0 swtch+0x139()
fffff7882b3dfd70 turnstile_block+0x279(fffffcfc0eba6258, 0, fffffffffc062b40, fffffffffc0185c0, 0, 0)
fffff7882b3dfde0 mutex_vector_enter+0x378(fffffffffc062b40)
fffff7882b3dfe80 tasksys_settaskid+0x2e0(1, 0)
fffff7882b3dfec0 tasksys+0x82(0, 1, 0, 0, 0)
fffff7882b3dff10 sys_syscall32+0x105()
> fffffffffc062b40::mutex
ADDR TYPE HELD MINSPL OLDSPL WAITERS
fffffffffc062b40 adapt fffffcfc1fcde7e0 - - yes
> fffffcfc1fcde7e0::findstack -v
stack pointer for thread fffffcfc1fcde7e0 (propolis-server/25 [vcpu-1]): fffff7881d6994b0
fffff7881d699520 apix_hilevel_intr_prolog+0x82(fffffcfa26d36000, f, 0, fffff7881d699590)
fffff7881d699580 apix_do_interrupt+0x3d4(fffff7881d699590, 401f2ff)
fffff7881d699590 _interrupt+0xc3()
fffff7881d6996b0 gethrtime()
fffff7881d699740 cyclic_juggle_one_to+0xbc(fffffcfafad6d860, fffffcfa2ee6c280)
fffff7881d699780 cyclic_move_here+0x88(fffffcfafad6d860)
fffff7881d6997b0 vmm_glue_callout_localize+0x27(fffffcfac6a8fee8)
fffff7881d6997d0 vlapic_localize_resources+0x15(fffffcfac6a8fea0)
fffff7881d699800 vm_localize_resources+0x46(fffffcfbf3c19000, fffffcfbf3c194a0)
fffff7881d699880 vm_run+0x114(fffffcfbf3c19000, 1, fffff7881d6999c0)
fffff7881d699c40 vmmdev_do_ioctl+0x8fa(fffffd003712d780, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
fffff7881d699cc0 vmm_ioctl+0x12f(9900000008, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
fffff7881d699d00 cdev_ioctl+0x3f(9900000008, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
fffff7881d699d50 spec_ioctl+0x55(fffffcfa55577800, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18, 0)
fffff7881d699de0 fop_ioctl+0x40(fffffcfa55577800, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18, 0)
fffff7881d699f00 ioctl+0x144(10, 767001, fffff5ffe75ff990)
fffff7881d699f10 sys_syscall+0x17d()
Now before we jump to cyclics, we have to remember that we're probably trying to take the CPU lock and related on the CPU and in a backoff case. This is very similar to https://github.com/oxidecomputer/stlouis/issues/718. So I moved and instead opted to look at the OPTE stacks. Here were the first two that I noticed:
> fffff78811769c20::findstack -v
stack pointer for thread fffff78811769c20 (unix`thread_create_intr()): fffff788117692a0
[ fffff788117692a0 resume_from_intr+0xbc() ]
fffff788117692d0 swtch+0x85()
fffff78811769370 turnstile_block+0x279(fffffcfa28a7aea0, 1, ffffffffc00f7890, fffffffffc0194a0, 0, 0)
fffff788117693e0 rw_enter_sleep+0x330(ffffffffc00f7890, 1)
fffff78811769930 xde_rx+0x2d2()
fffff78811769980 mac_promisc_dispatch_one+0x60(fffffcfa7dd8eee8, fffffd0065edf8e0, 0, 0)
fffff78811769a00 mac_promisc_dispatch+0x83(fffffcfa3a09c010, fffffd0065edf8e0, 0, 0)
fffff78811769a60 mac_rx_common+0x47(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0)
fffff78811769ac0 mac_rx+0xc6(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0)
fffff78811769b00 mac_rx_ring+0x2b(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0, 1)
fffff78811769b40 t4_intr_rx_work+0xb4(fffffcfa31388130)
fffff78811769b80 t4_intr+0x5d(fffffcfa3c9ed300, fffffcfa31388130)
fffff78811769bd0 apix_dispatch_by_vector+0x8c(20)
fffff78811769c00 apix_dispatch_lowlevel+0x29(20, 0)
fffff788117189f0 switch_sp_and_call+0x15()
fffff78811718a50 apix_do_interrupt+0x3d4(fffff78811718a60, 41)
fffff78811718a60 _interrupt+0xc3()
fffff78811718bb0 cpu_cstate_enter+0x195(fffffcfa249e4bc0)
fffff78811718bd0 cpu_cstate_idle+0x44()
fffff78811718be0 cpu_idle_adaptive+0x19()
fffff78811718c00 idle+0xa8()
fffff78811718c10 thread_start+0xb()
> ffffffffc00f7890::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffffc00f7890 fffffcfa361bc0a0 B111 fffffcfa361ceb60 (W)
||| fffff78811769c20 (R)
WRITE_LOCKED ------+|| fffff78809c80c20 (R)
WRITE_WANTED -------+|
HAS_WAITERS --------+
> fffffcfa361bc0a0::findstack -v
stack pointer for thread fffffcfa361bc0a0 (sled-agent/126 [tokio-runtime-worker]): fffff788104338a0
[ fffff788104338a0 _resume_from_idle+0x12b() ]
fffff788104338d0 swtch+0x139()
fffff78810433940 stop+0x445(7, 0)
fffff788104339e0 issig_forreal+0x227()
fffff78810433a00 issig+0x15(0)
fffff78810433aa0 door_upcall+0x3df(fffffcfa31468d80, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
fffff78810433b20 door_ki_upcall_limited+0x74(fffffcfa2f4fedc8, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
fffff78810433bd0 i_dls_mgmt_upcall+0xbd(fffff78810433be0, 8, fffff78810433be8, 30)
fffff78810433c80 dls_mgmt_get_linkinfo+0x79(21, fffff78810433cb8, fffff78810433cac, 0, 0)
fffff78810433d30 dls_devnet_set+0xf3(fffffd006bc8c000, 21, 0, fffff78810433d58)
fffff78810433da0 dls_devnet_create+0x44(fffffd006bc8c000, 21, 0)
fffff78810434040 create_xde+0x1abc()
fffff78810434c90 xde_ioc_opte_cmd+0x99a2()
fffff78810434cc0 xde_ioctl+0x7e()
fffff78810434d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18)
fffff78810434d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
fffff78810434de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
fffff78810434f00 ioctl+0x144(49, de777701, fffff5ffcedfc0a0)
fffff78810434f10 sys_syscall+0x17d()
Looking at that thread, we could see it was part of sled agent. Also notable the upcall thread who was holding the lock was stopped. From there, I looked again at the list of all the stacks in OPTE and sled-agent:
> ::stacks -m xde
THREAD STATE SOBJ COUNT
fffff78811769c20 SLEEP RWLOCK 1
swtch+0x85
turnstile_block+0x279
rw_enter_sleep+0x330
xde_rx+0x2d2
mac_promisc_dispatch_one+0x60
mac_promisc_dispatch+0x83
mac_rx_common+0x47
mac_rx+0xc6
mac_rx_ring+0x2b
t4_intr_rx_work+0xb4
t4_intr+0x5d
apix_dispatch_by_vector+0x8c
apix_dispatch_lowlevel+0x29
fffffcfa361ceb60 SLEEP RWLOCK 1
swtch+0x139
turnstile_block+0x279
rw_enter_sleep+0x1de
create_xde+0x63
xde_ioc_opte_cmd+0x99a2
xde_ioctl+0x7e
cdev_ioctl+0x3f
spec_ioctl+0x55
fop_ioctl+0x40
ioctl+0x144
fffff78809c80c20 SLEEP RWLOCK 1
swtch+0x139
turnstile_block+0x279
rw_enter_sleep+0x330
shared_periodic_expire+0x25
periodic_execute+0xf5
taskq_thread+0x2a6
thread_start+0xb
fffffcfa361bc0a0 STOPPED <NONE> 1
swtch+0x139
stop+0x445
issig_forreal+0x227
issig+0x15
door_upcall+0x3df
door_ki_upcall_limited+0x74
i_dls_mgmt_upcall+0xbd
dls_mgmt_get_linkinfo+0x79
dls_devnet_set+0xf3
dls_devnet_create+0x44
create_xde+0x1abc
xde_ioc_opte_cmd+0x99a2
xde_ioctl+0x7e
cdev_ioctl+0x3f
spec_ioctl+0x55
fop_ioctl+0x40
ioctl+0x144
> ::pgrep sled-agent | ::walk thread | ::stacks
THREAD STATE SOBJ COUNT
fffffcfa361c2060 SLEEP CV 5
swtch+0x139
cv_wait_sig_swap_core+0x160
cv_wait_sig_swap+0x17
cv_waituntil_sig+0xcd
lwp_park+0x13f
syslwp_park+0x85
fffffcfa34ad3c40 STOPPED <NONE> 124
swtch+0x139
stop+0x445
issig_forreal+0x227
issig+0x15
cv_wait_sig_swap_core+0x277
cv_wait_sig_swap+0x17
cv_waituntil_sig+0xcd
lwp_park+0x13f
syslwp_park+0x85
fffffcfc174520c0 STOPPED <NONE> 2
swtch+0x139
stop+0x445
issig_forreal+0x227
issig+0x15
cv_timedwait_sig_hires+0x37f
cv_waituntil_sig+0x105
lwp_park+0x13f
syslwp_park+0x85
fffffcfa361bf080 SLEEP CV 1
swtch+0x139
cv_wait_sig+0x154
lwp_suspend+0x85
syslwp_suspend+0x44
fffffcfa361ceb60 SLEEP RWLOCK 1
swtch+0x139
turnstile_block+0x279
rw_enter_sleep+0x1de
create_xde+0x63
xde_ioc_opte_cmd+0x99a2
xde_ioctl+0x7e
cdev_ioctl+0x3f
spec_ioctl+0x55
fop_ioctl+0x40
ioctl+0x144
fffffcfa346d87e0 STOPPED <NONE> 1
swtch+0x139
stop+0x445
issig_forreal+0x227
issig+0x15
cv_wait_sig+0x28f
cte_get_event+0x93
ctfs_endpoint_ioctl+0xe1
ctfs_bu_ioctl+0x4e
fop_ioctl+0x40
ioctl+0x144
fffffcfa361bc0a0 STOPPED <NONE> 1
swtch+0x139
stop+0x445
issig_forreal+0x227
issig+0x15
door_upcall+0x3df
door_ki_upcall_limited+0x74
i_dls_mgmt_upcall+0xbd
dls_mgmt_get_linkinfo+0x79
dls_devnet_set+0xf3
dls_devnet_create+0x44
create_xde+0x1abc
xde_ioc_opte_cmd+0x99a2
xde_ioctl+0x7e
cdev_ioctl+0x3f
spec_ioctl+0x55
fop_ioctl+0x40
ioctl+0x144
So there were a few things I noticed:
- First, the thread
fffff78809c80c20
is tryin to execute periodic related tasks. This to me was suspicious, but I will admit I didn't fully link it back at first reading. - With some additional logic, we can see that we can't remove the cyclic because of local pending entries.
- You'll see that all of sled-agent is paused or trying to. It took me a while to spot the thread calling
lwp_suspend
here. I also noticed that everyone's code for stopping is 7, 0, which is basically this was requested by one of several code paths.
Let's go into a bit of this part of the deadlock that's going on. First, the following thread owns the xde rwlock:
> fffffcfa361bc0a0::findstack -v
stack pointer for thread fffffcfa361bc0a0 (sled-agent/126 [tokio-runtime-worker]): fffff788104338a0
[ fffff788104338a0 _resume_from_idle+0x12b() ]
fffff788104338d0 swtch+0x139()
fffff78810433940 stop+0x445(7, 0)
fffff788104339e0 issig_forreal+0x227()
fffff78810433a00 issig+0x15(0)
fffff78810433aa0 door_upcall+0x3df(fffffcfa31468d80, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
fffff78810433b20 door_ki_upcall_limited+0x74(fffffcfa2f4fedc8, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
fffff78810433bd0 i_dls_mgmt_upcall+0xbd(fffff78810433be0, 8, fffff78810433be8, 30)
fffff78810433c80 dls_mgmt_get_linkinfo+0x79(21, fffff78810433cb8, fffff78810433cac, 0, 0)
fffff78810433d30 dls_devnet_set+0xf3(fffffd006bc8c000, 21, 0, fffff78810433d58)
fffff78810433da0 dls_devnet_create+0x44(fffffd006bc8c000, 21, 0)
fffff78810434040 create_xde+0x1abc()
fffff78810434c90 xde_ioc_opte_cmd+0x99a2()
fffff78810434cc0 xde_ioctl+0x7e()
fffff78810434d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18)
fffff78810434d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
fffff78810434de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
fffff78810434f00 ioctl+0x144(49, de777701, fffff5ffcedfc0a0)
fffff78810434f10 sys_syscall+0x17d()
They are stopped, requested based on the evidence of the lwp_suspend
by userland and practically speaking going to be trying to do a fork. The guess for fork is that sled-agent already has a child zoneadm
process it's trying to clean up and we're trying to start several threads by calling xde_create.
So, now if this fork theory is true, the one that needs to make progress is our thread calling lwp suspend. Let's see who that is and who they're trying to target:
> fffffcfa361bf080::findstack -v
stack pointer for thread fffffcfa361bf080 (sled-agent/125 [tokio-runtime-worker]): fffff7881042ede0
[ fffff7881042ede0 _resume_from_idle+0x12b() ]
fffff7881042ee10 swtch+0x139()
fffff7881042ee80 cv_wait_sig+0x154(fffffcfa3529e0e6, fffffcf932634700)
fffff7881042eec0 lwp_suspend+0x85(fffffcfa361ceb60)
fffff7881042ef00 syslwp_suspend+0x44(77)
fffff7881042ef10 sys_syscall+0x17d()
> fffffcfa361ceb60::findstack -v
stack pointer for thread fffffcfa361ceb60 (sled-agent/119 [tokio-runtime-worker]): fffff7880b6f3c60
[ fffff7880b6f3c60 _resume_from_idle+0x12b() ]
fffff7880b6f3c90 swtch+0x139()
fffff7880b6f3d30 turnstile_block+0x279(0, 0, ffffffffc00f7890, fffffffffc0194a0, 0, 0)
fffff7880b6f3da0 rw_enter_sleep+0x1de(ffffffffc00f7890, 0)
fffff7880b6f4040 create_xde+0x63()
fffff7880b6f4c90 xde_ioc_opte_cmd+0x99a2()
fffff7880b6f4cc0 xde_ioctl+0x7e()
fffff7880b6f4d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18)
fffff7880b6f4d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18, 0)
fffff7880b6f4de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18, 0)
fffff7880b6f4f00 ioctl+0x144(46, de777701, fffff5ffd09fc0a0)
fffff7880b6f4f10 sys_syscall+0x17d()
OK, based on this we can see how this has started to form. We cannot suspend a thread because it's stuck waiting on an rwlock. The owner of the rwlock with a write lock is our door upcall. However, the upcall opportunity allows it to be stopped. Because we have hit this deadlock, we cannot execute the periodic. I'll admit it's not 100% clear to me yet how that leads to the back up of cyclic juggling, but it kind of follows that if some cyclics are unable to execute, perhaps that all gets backed up.
I believe the fundamental issue here is that we should not be holding locks across upcalls. While I don't know all of the moving pieces here, logically speaking we probably need to adjust the creation/destruction to a combination of data protected by a mutex and cv where the mutex protects the active kthread_t. The kthread_t will be granted ownership when it can put itself in there, and then everyone who is waiting can wait on the cv. This likely doesn't solve the other non-creation/destruction paths so something for us to discuss.