Skip to content

Commit 4f8126b

Browse files
krismanaxboe
authored andcommitted
sbitmap: Use single per-bitmap counting to wake up queued tags
sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code. The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial. This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time. Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers. For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing. Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test. queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results. [1] https://lore.kernel.org/all/[email protected]/ Cc: Hugh Dickins <[email protected]> Cc: Keith Busch <[email protected]> Cc: Liu Song <[email protected]> Suggested-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
1 parent ee9d552 commit 4f8126b

File tree

2 files changed

+37
-101
lines changed

2 files changed

+37
-101
lines changed

include/linux/sbitmap.h

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -86,11 +86,6 @@ struct sbitmap {
8686
* struct sbq_wait_state - Wait queue in a &struct sbitmap_queue.
8787
*/
8888
struct sbq_wait_state {
89-
/**
90-
* @wait_cnt: Number of frees remaining before we wake up.
91-
*/
92-
atomic_t wait_cnt;
93-
9489
/**
9590
* @wait: Wait queue.
9691
*/
@@ -138,6 +133,17 @@ struct sbitmap_queue {
138133
* sbitmap_queue_get_shallow()
139134
*/
140135
unsigned int min_shallow_depth;
136+
137+
/**
138+
* @completion_cnt: Number of bits cleared passed to the
139+
* wakeup function.
140+
*/
141+
atomic_t completion_cnt;
142+
143+
/**
144+
* @wakeup_cnt: Number of thread wake ups issued.
145+
*/
146+
atomic_t wakeup_cnt;
141147
};
142148

143149
/**

lib/sbitmap.c

Lines changed: 26 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -434,47 +434,30 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
434434
sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
435435
atomic_set(&sbq->wake_index, 0);
436436
atomic_set(&sbq->ws_active, 0);
437+
atomic_set(&sbq->completion_cnt, 0);
438+
atomic_set(&sbq->wakeup_cnt, 0);
437439

438440
sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
439441
if (!sbq->ws) {
440442
sbitmap_free(&sbq->sb);
441443
return -ENOMEM;
442444
}
443445

444-
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
446+
for (i = 0; i < SBQ_WAIT_QUEUES; i++)
445447
init_waitqueue_head(&sbq->ws[i].wait);
446-
atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
447-
}
448448

449449
return 0;
450450
}
451451
EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
452452

453-
static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
454-
unsigned int wake_batch)
455-
{
456-
int i;
457-
458-
if (sbq->wake_batch != wake_batch) {
459-
WRITE_ONCE(sbq->wake_batch, wake_batch);
460-
/*
461-
* Pairs with the memory barrier in sbitmap_queue_wake_up()
462-
* to ensure that the batch size is updated before the wait
463-
* counts.
464-
*/
465-
smp_mb();
466-
for (i = 0; i < SBQ_WAIT_QUEUES; i++)
467-
atomic_set(&sbq->ws[i].wait_cnt, 1);
468-
}
469-
}
470-
471453
static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
472454
unsigned int depth)
473455
{
474456
unsigned int wake_batch;
475457

476458
wake_batch = sbq_calc_wake_batch(sbq, depth);
477-
__sbitmap_queue_update_wake_batch(sbq, wake_batch);
459+
if (sbq->wake_batch != wake_batch)
460+
WRITE_ONCE(sbq->wake_batch, wake_batch);
478461
}
479462

480463
void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
@@ -488,7 +471,8 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
488471

489472
wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES,
490473
min_batch, SBQ_WAKE_BATCH);
491-
__sbitmap_queue_update_wake_batch(sbq, wake_batch);
474+
475+
WRITE_ONCE(sbq->wake_batch, wake_batch);
492476
}
493477
EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
494478

@@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
587571
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
588572
struct sbq_wait_state *ws = &sbq->ws[wake_index];
589573

590-
if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
574+
if (waitqueue_active(&ws->wait)) {
591575
if (wake_index != atomic_read(&sbq->wake_index))
592576
atomic_set(&sbq->wake_index, wake_index);
593577
return ws;
@@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
599583
return NULL;
600584
}
601585

602-
static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
586+
void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
603587
{
604-
struct sbq_wait_state *ws;
605-
unsigned int wake_batch;
606-
int wait_cnt, cur, sub;
607-
bool ret;
588+
unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
589+
struct sbq_wait_state *ws = NULL;
590+
unsigned int wakeups;
608591

609-
if (*nr <= 0)
610-
return false;
592+
if (!atomic_read(&sbq->ws_active))
593+
return;
611594

612-
ws = sbq_wake_ptr(sbq);
613-
if (!ws)
614-
return false;
595+
atomic_add(nr, &sbq->completion_cnt);
596+
wakeups = atomic_read(&sbq->wakeup_cnt);
615597

616-
cur = atomic_read(&ws->wait_cnt);
617598
do {
618-
/*
619-
* For concurrent callers of this, callers should call this
620-
* function again to wakeup a new batch on a different 'ws'.
621-
*/
622-
if (cur == 0)
623-
return true;
624-
sub = min(*nr, cur);
625-
wait_cnt = cur - sub;
626-
} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
627-
628-
/*
629-
* If we decremented queue without waiters, retry to avoid lost
630-
* wakeups.
631-
*/
632-
if (wait_cnt > 0)
633-
return !waitqueue_active(&ws->wait);
599+
if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
600+
return;
634601

635-
*nr -= sub;
636-
637-
/*
638-
* When wait_cnt == 0, we have to be particularly careful as we are
639-
* responsible to reset wait_cnt regardless whether we've actually
640-
* woken up anybody. But in case we didn't wakeup anybody, we still
641-
* need to retry.
642-
*/
643-
ret = !waitqueue_active(&ws->wait);
644-
wake_batch = READ_ONCE(sbq->wake_batch);
602+
if (!ws) {
603+
ws = sbq_wake_ptr(sbq);
604+
if (!ws)
605+
return;
606+
}
607+
} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
608+
&wakeups, wakeups + wake_batch));
645609

646-
/*
647-
* Wake up first in case that concurrent callers decrease wait_cnt
648-
* while waitqueue is empty.
649-
*/
650610
wake_up_nr(&ws->wait, wake_batch);
651-
652-
/*
653-
* Pairs with the memory barrier in sbitmap_queue_resize() to
654-
* ensure that we see the batch size update before the wait
655-
* count is reset.
656-
*
657-
* Also pairs with the implicit barrier between decrementing wait_cnt
658-
* and checking for waitqueue_active() to make sure waitqueue_active()
659-
* sees result of the wakeup if atomic_dec_return() has seen the result
660-
* of atomic_set().
661-
*/
662-
smp_mb__before_atomic();
663-
664-
/*
665-
* Increase wake_index before updating wait_cnt, otherwise concurrent
666-
* callers can see valid wait_cnt in old waitqueue, which can cause
667-
* invalid wakeup on the old waitqueue.
668-
*/
669-
sbq_index_atomic_inc(&sbq->wake_index);
670-
atomic_set(&ws->wait_cnt, wake_batch);
671-
672-
return ret || *nr;
673-
}
674-
675-
void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
676-
{
677-
while (__sbq_wake_up(sbq, &nr))
678-
;
679611
}
680612
EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
681613

@@ -792,9 +724,7 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
792724
seq_puts(m, "ws={\n");
793725
for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
794726
struct sbq_wait_state *ws = &sbq->ws[i];
795-
796-
seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
797-
atomic_read(&ws->wait_cnt),
727+
seq_printf(m, "\t{.wait=%s},\n",
798728
waitqueue_active(&ws->wait) ? "active" : "inactive");
799729
}
800730
seq_puts(m, "}\n");

0 commit comments

Comments
 (0)