Skip to content

xde deadlock due to upcalls and sled-agent forks #758

@rmustacc

Description

@rmustacc

We hit an interesting case that looked very similar to part of the issue that motivated using a single periodic in I believe f514524. This manifested itself by @leftwo injecting an NMI in the system. I started by looking at his hung login process:

> ::pgrep login                       
S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
R 549    9      549    549    0      0x4a004000 fffffcfa2f64c008 login
> ::pgrep login | ::walk thread 
0xfffffcfa37f94080
> ::pgrep login | ::walk thread  | ::findstack -v
stack pointer for thread fffffcfa37f94080 (login/1): fffff788103c1ca0
[ fffff788103c1ca0 _resume_from_idle+0x12b() ]
  fffff788103c1cd0 swtch+0x139()
  fffff788103c1d70 turnstile_block+0x279(0, 0, fffffcfac0d6b138, fffffffffc0185c0, 0, 0)
  fffff788103c1de0 mutex_vector_enter+0x378(fffffcfac0d6b138)
  fffff788103c1e80 tasksys_settaskid+0x2b2(1, 0)
  fffff788103c1ec0 tasksys+0x82(0, 1, 0, 0, 0)
  fffff788103c1f10 sys_syscall32+0x105()
> fffffcfac0d6b138::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffcfac0d6b138 adapt fffffcfada52cbc0      -      -     yes
> fffffcfada52cbc0::findstack -v
stack pointer for thread fffffcfada52cbc0 (cron/1): fffff7882b3dfca0
[ fffff7882b3dfca0 _resume_from_idle+0x12b() ]
  fffff7882b3dfcd0 swtch+0x139()
  fffff7882b3dfd70 turnstile_block+0x279(fffffcfc0eba6258, 0, fffffffffc062b40, fffffffffc0185c0, 0, 0)
  fffff7882b3dfde0 mutex_vector_enter+0x378(fffffffffc062b40)
  fffff7882b3dfe80 tasksys_settaskid+0x2e0(1, 0)
  fffff7882b3dfec0 tasksys+0x82(0, 1, 0, 0, 0)
  fffff7882b3dff10 sys_syscall32+0x105()
> fffffffffc062b40::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffffffc062b40 adapt fffffcfc1fcde7e0      -      -     yes
> fffffcfc1fcde7e0::findstack -v
stack pointer for thread fffffcfc1fcde7e0 (propolis-server/25 [vcpu-1]): fffff7881d6994b0
  fffff7881d699520 apix_hilevel_intr_prolog+0x82(fffffcfa26d36000, f, 0, fffff7881d699590)
  fffff7881d699580 apix_do_interrupt+0x3d4(fffff7881d699590, 401f2ff)
  fffff7881d699590 _interrupt+0xc3()
  fffff7881d6996b0 gethrtime()
  fffff7881d699740 cyclic_juggle_one_to+0xbc(fffffcfafad6d860, fffffcfa2ee6c280)
  fffff7881d699780 cyclic_move_here+0x88(fffffcfafad6d860)
  fffff7881d6997b0 vmm_glue_callout_localize+0x27(fffffcfac6a8fee8)
  fffff7881d6997d0 vlapic_localize_resources+0x15(fffffcfac6a8fea0)
  fffff7881d699800 vm_localize_resources+0x46(fffffcfbf3c19000, fffffcfbf3c194a0)
  fffff7881d699880 vm_run+0x114(fffffcfbf3c19000, 1, fffff7881d6999c0)
  fffff7881d699c40 vmmdev_do_ioctl+0x8fa(fffffd003712d780, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
  fffff7881d699cc0 vmm_ioctl+0x12f(9900000008, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
  fffff7881d699d00 cdev_ioctl+0x3f(9900000008, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18)
  fffff7881d699d50 spec_ioctl+0x55(fffffcfa55577800, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18, 0)
  fffff7881d699de0 fop_ioctl+0x40(fffffcfa55577800, 767001, fffff5ffe75ff990, 202003, fffffcfc2628e788, fffff7881d699e18, 0)
  fffff7881d699f00 ioctl+0x144(10, 767001, fffff5ffe75ff990)
  fffff7881d699f10 sys_syscall+0x17d()

Now before we jump to cyclics, we have to remember that we're probably trying to take the CPU lock and related on the CPU and in a backoff case. This is very similar to https://github.com/oxidecomputer/stlouis/issues/718. So I moved and instead opted to look at the OPTE stacks. Here were the first two that I noticed:

> fffff78811769c20::findstack -v
stack pointer for thread fffff78811769c20 (unix`thread_create_intr()): fffff788117692a0
[ fffff788117692a0 resume_from_intr+0xbc() ]
  fffff788117692d0 swtch+0x85()
  fffff78811769370 turnstile_block+0x279(fffffcfa28a7aea0, 1, ffffffffc00f7890, fffffffffc0194a0, 0, 0)
  fffff788117693e0 rw_enter_sleep+0x330(ffffffffc00f7890, 1)
  fffff78811769930 xde_rx+0x2d2()
  fffff78811769980 mac_promisc_dispatch_one+0x60(fffffcfa7dd8eee8, fffffd0065edf8e0, 0, 0)
  fffff78811769a00 mac_promisc_dispatch+0x83(fffffcfa3a09c010, fffffd0065edf8e0, 0, 0)
  fffff78811769a60 mac_rx_common+0x47(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0)
  fffff78811769ac0 mac_rx+0xc6(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0)
  fffff78811769b00 mac_rx_ring+0x2b(fffffcfa3a09c010, fffffcfa3e069d88, fffffd0065edf8e0, 1)
  fffff78811769b40 t4_intr_rx_work+0xb4(fffffcfa31388130)
  fffff78811769b80 t4_intr+0x5d(fffffcfa3c9ed300, fffffcfa31388130)
  fffff78811769bd0 apix_dispatch_by_vector+0x8c(20)
  fffff78811769c00 apix_dispatch_lowlevel+0x29(20, 0)
  fffff788117189f0 switch_sp_and_call+0x15()
  fffff78811718a50 apix_do_interrupt+0x3d4(fffff78811718a60, 41)
  fffff78811718a60 _interrupt+0xc3()
  fffff78811718bb0 cpu_cstate_enter+0x195(fffffcfa249e4bc0)
  fffff78811718bd0 cpu_cstate_idle+0x44()
  fffff78811718be0 cpu_idle_adaptive+0x19()
  fffff78811718c00 idle+0xa8()
  fffff78811718c10 thread_start+0xb()
> ffffffffc00f7890::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffffffc00f7890 fffffcfa361bc0a0  B111 fffffcfa361ceb60 (W)
                                    ||| fffff78811769c20 (R)
                 WRITE_LOCKED ------+|| fffff78809c80c20 (R)
                 WRITE_WANTED -------+|
                  HAS_WAITERS --------+
> fffffcfa361bc0a0::findstack -v
stack pointer for thread fffffcfa361bc0a0 (sled-agent/126 [tokio-runtime-worker]): fffff788104338a0
[ fffff788104338a0 _resume_from_idle+0x12b() ]
  fffff788104338d0 swtch+0x139()
  fffff78810433940 stop+0x445(7, 0)
  fffff788104339e0 issig_forreal+0x227()
  fffff78810433a00 issig+0x15(0)
  fffff78810433aa0 door_upcall+0x3df(fffffcfa31468d80, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
  fffff78810433b20 door_ki_upcall_limited+0x74(fffffcfa2f4fedc8, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
  fffff78810433bd0 i_dls_mgmt_upcall+0xbd(fffff78810433be0, 8, fffff78810433be8, 30)
  fffff78810433c80 dls_mgmt_get_linkinfo+0x79(21, fffff78810433cb8, fffff78810433cac, 0, 0)
  fffff78810433d30 dls_devnet_set+0xf3(fffffd006bc8c000, 21, 0, fffff78810433d58)
  fffff78810433da0 dls_devnet_create+0x44(fffffd006bc8c000, 21, 0)
  fffff78810434040 create_xde+0x1abc()
  fffff78810434c90 xde_ioc_opte_cmd+0x99a2()
  fffff78810434cc0 xde_ioctl+0x7e()
  fffff78810434d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18)
  fffff78810434d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
  fffff78810434de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
  fffff78810434f00 ioctl+0x144(49, de777701, fffff5ffcedfc0a0)
  fffff78810434f10 sys_syscall+0x17d()

Looking at that thread, we could see it was part of sled agent. Also notable the upcall thread who was holding the lock was stopped. From there, I looked again at the list of all the stacks in OPTE and sled-agent:

> ::stacks -m xde 
THREAD           STATE    SOBJ                COUNT
fffff78811769c20 SLEEP    RWLOCK                  1
                 swtch+0x85
                 turnstile_block+0x279
                 rw_enter_sleep+0x330
                 xde_rx+0x2d2
                 mac_promisc_dispatch_one+0x60
                 mac_promisc_dispatch+0x83
                 mac_rx_common+0x47
                 mac_rx+0xc6
                 mac_rx_ring+0x2b
                 t4_intr_rx_work+0xb4
                 t4_intr+0x5d
                 apix_dispatch_by_vector+0x8c
                 apix_dispatch_lowlevel+0x29

fffffcfa361ceb60 SLEEP    RWLOCK                  1
                 swtch+0x139
                 turnstile_block+0x279
                 rw_enter_sleep+0x1de
                 create_xde+0x63
                 xde_ioc_opte_cmd+0x99a2
                 xde_ioctl+0x7e
                 cdev_ioctl+0x3f
                 spec_ioctl+0x55
                 fop_ioctl+0x40
                 ioctl+0x144

fffff78809c80c20 SLEEP    RWLOCK                  1
                 swtch+0x139
                 turnstile_block+0x279
                 rw_enter_sleep+0x330
                 shared_periodic_expire+0x25
                 periodic_execute+0xf5
                 taskq_thread+0x2a6
                 thread_start+0xb

fffffcfa361bc0a0 STOPPED  <NONE>                  1
                 swtch+0x139
                 stop+0x445
                 issig_forreal+0x227
                 issig+0x15
                 door_upcall+0x3df
                 door_ki_upcall_limited+0x74
                 i_dls_mgmt_upcall+0xbd
                 dls_mgmt_get_linkinfo+0x79
                 dls_devnet_set+0xf3
                 dls_devnet_create+0x44
                 create_xde+0x1abc
                 xde_ioc_opte_cmd+0x99a2
                 xde_ioctl+0x7e
                 cdev_ioctl+0x3f
                 spec_ioctl+0x55
                 fop_ioctl+0x40
                 ioctl+0x144
> ::pgrep sled-agent | ::walk thread | ::stacks
THREAD           STATE    SOBJ                COUNT
fffffcfa361c2060 SLEEP    CV                      5
                 swtch+0x139
                 cv_wait_sig_swap_core+0x160
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xcd
                 lwp_park+0x13f
                 syslwp_park+0x85

fffffcfa34ad3c40 STOPPED  <NONE>                124
                 swtch+0x139
                 stop+0x445
                 issig_forreal+0x227
                 issig+0x15
                 cv_wait_sig_swap_core+0x277
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xcd
                 lwp_park+0x13f
                 syslwp_park+0x85

fffffcfc174520c0 STOPPED  <NONE>                  2
                 swtch+0x139
                 stop+0x445
                 issig_forreal+0x227
                 issig+0x15
                 cv_timedwait_sig_hires+0x37f
                 cv_waituntil_sig+0x105
                 lwp_park+0x13f
                 syslwp_park+0x85

fffffcfa361bf080 SLEEP    CV                      1
                 swtch+0x139
                 cv_wait_sig+0x154
                 lwp_suspend+0x85
                 syslwp_suspend+0x44

fffffcfa361ceb60 SLEEP    RWLOCK                  1
                 swtch+0x139
                 turnstile_block+0x279
                 rw_enter_sleep+0x1de
                 create_xde+0x63
                 xde_ioc_opte_cmd+0x99a2
                 xde_ioctl+0x7e
                 cdev_ioctl+0x3f
                 spec_ioctl+0x55
                 fop_ioctl+0x40
                 ioctl+0x144

fffffcfa346d87e0 STOPPED  <NONE>                  1
                 swtch+0x139
                 stop+0x445
                 issig_forreal+0x227
                 issig+0x15
                 cv_wait_sig+0x28f
                 cte_get_event+0x93
                 ctfs_endpoint_ioctl+0xe1
                 ctfs_bu_ioctl+0x4e
                 fop_ioctl+0x40
                 ioctl+0x144

fffffcfa361bc0a0 STOPPED  <NONE>                  1
                 swtch+0x139
                 stop+0x445
                 issig_forreal+0x227
                 issig+0x15           
                 door_upcall+0x3df
                 door_ki_upcall_limited+0x74
                 i_dls_mgmt_upcall+0xbd
                 dls_mgmt_get_linkinfo+0x79
                 dls_devnet_set+0xf3
                 dls_devnet_create+0x44
                 create_xde+0x1abc
                 xde_ioc_opte_cmd+0x99a2
                 xde_ioctl+0x7e
                 cdev_ioctl+0x3f
                 spec_ioctl+0x55
                 fop_ioctl+0x40
                 ioctl+0x144

So there were a few things I noticed:

  • First, the thread fffff78809c80c20 is tryin to execute periodic related tasks. This to me was suspicious, but I will admit I didn't fully link it back at first reading.
  • With some additional logic, we can see that we can't remove the cyclic because of local pending entries.
  • You'll see that all of sled-agent is paused or trying to. It took me a while to spot the thread calling lwp_suspend here. I also noticed that everyone's code for stopping is 7, 0, which is basically this was requested by one of several code paths.

Let's go into a bit of this part of the deadlock that's going on. First, the following thread owns the xde rwlock:

> fffffcfa361bc0a0::findstack -v
stack pointer for thread fffffcfa361bc0a0 (sled-agent/126 [tokio-runtime-worker]): fffff788104338a0
[ fffff788104338a0 _resume_from_idle+0x12b() ]
  fffff788104338d0 swtch+0x139()
  fffff78810433940 stop+0x445(7, 0)
  fffff788104339e0 issig_forreal+0x227()
  fffff78810433a00 issig+0x15(0)
  fffff78810433aa0 door_upcall+0x3df(fffffcfa31468d80, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
  fffff78810433b20 door_ki_upcall_limited+0x74(fffffcfa2f4fedc8, fffff78810433b48, fffffcf93257edb0, ffffffffffffffff, 0)
  fffff78810433bd0 i_dls_mgmt_upcall+0xbd(fffff78810433be0, 8, fffff78810433be8, 30)
  fffff78810433c80 dls_mgmt_get_linkinfo+0x79(21, fffff78810433cb8, fffff78810433cac, 0, 0)
  fffff78810433d30 dls_devnet_set+0xf3(fffffd006bc8c000, 21, 0, fffff78810433d58)
  fffff78810433da0 dls_devnet_create+0x44(fffffd006bc8c000, 21, 0)
  fffff78810434040 create_xde+0x1abc()
  fffff78810434c90 xde_ioc_opte_cmd+0x99a2()
  fffff78810434cc0 xde_ioctl+0x7e()
  fffff78810434d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18)
  fffff78810434d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
  fffff78810434de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffcedfc0a0, 202003, fffffcfa3cacb8d0, fffff78810434e18, 0)
  fffff78810434f00 ioctl+0x144(49, de777701, fffff5ffcedfc0a0)
  fffff78810434f10 sys_syscall+0x17d()

They are stopped, requested based on the evidence of the lwp_suspend by userland and practically speaking going to be trying to do a fork. The guess for fork is that sled-agent already has a child zoneadm process it's trying to clean up and we're trying to start several threads by calling xde_create.

So, now if this fork theory is true, the one that needs to make progress is our thread calling lwp suspend. Let's see who that is and who they're trying to target:

> fffffcfa361bf080::findstack -v
stack pointer for thread fffffcfa361bf080 (sled-agent/125 [tokio-runtime-worker]): fffff7881042ede0
[ fffff7881042ede0 _resume_from_idle+0x12b() ]
  fffff7881042ee10 swtch+0x139()
  fffff7881042ee80 cv_wait_sig+0x154(fffffcfa3529e0e6, fffffcf932634700)
  fffff7881042eec0 lwp_suspend+0x85(fffffcfa361ceb60)
  fffff7881042ef00 syslwp_suspend+0x44(77)
  fffff7881042ef10 sys_syscall+0x17d()
> fffffcfa361ceb60::findstack -v
stack pointer for thread fffffcfa361ceb60 (sled-agent/119 [tokio-runtime-worker]): fffff7880b6f3c60
[ fffff7880b6f3c60 _resume_from_idle+0x12b() ]
  fffff7880b6f3c90 swtch+0x139()
  fffff7880b6f3d30 turnstile_block+0x279(0, 0, ffffffffc00f7890, fffffffffc0194a0, 0, 0)
  fffff7880b6f3da0 rw_enter_sleep+0x1de(ffffffffc00f7890, 0)
  fffff7880b6f4040 create_xde+0x63()
  fffff7880b6f4c90 xde_ioc_opte_cmd+0x99a2()
  fffff7880b6f4cc0 xde_ioctl+0x7e()
  fffff7880b6f4d00 cdev_ioctl+0x3f(b200020000, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18)
  fffff7880b6f4d50 spec_ioctl+0x55(fffffcfaa882da80, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18, 0)
  fffff7880b6f4de0 fop_ioctl+0x40(fffffcfaa882da80, de777701, fffff5ffd09fc0a0, 202003, fffffcfa3cacb8d0, fffff7880b6f4e18, 0)
  fffff7880b6f4f00 ioctl+0x144(46, de777701, fffff5ffd09fc0a0)
  fffff7880b6f4f10 sys_syscall+0x17d()

OK, based on this we can see how this has started to form. We cannot suspend a thread because it's stuck waiting on an rwlock. The owner of the rwlock with a write lock is our door upcall. However, the upcall opportunity allows it to be stopped. Because we have hit this deadlock, we cannot execute the periodic. I'll admit it's not 100% clear to me yet how that leads to the back up of cyclic juggling, but it kind of follows that if some cyclics are unable to execute, perhaps that all gets backed up.

I believe the fundamental issue here is that we should not be holding locks across upcalls. While I don't know all of the moving pieces here, logically speaking we probably need to adjust the creation/destruction to a combination of data protected by a mutex and cv where the mutex protects the active kthread_t. The kthread_t will be granted ownership when it can put itself in there, and then everyone who is waiting can wait on the cv. This likely doesn't solve the other non-creation/destruction paths so something for us to discuss.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions