Skip to content

Commit 7a97311

Browse files
josefbacikgregkh
authored andcommitted
btrfs: do not do delalloc reservation under page lock
[ Upstream commit f4b1363 ] We ran into a deadlock in production with the fixup worker. The stack traces were as follows: Thread responsible for the writeout, waiting on the page lock [<0>] io_schedule+0x12/0x40 [<0>] __lock_page+0x109/0x1e0 [<0>] extent_write_cache_pages+0x206/0x360 [<0>] extent_writepages+0x40/0x60 [<0>] do_writepages+0x31/0xb0 [<0>] __writeback_single_inode+0x3d/0x350 [<0>] writeback_sb_inodes+0x19d/0x3c0 [<0>] __writeback_inodes_wb+0x5d/0xb0 [<0>] wb_writeback+0x231/0x2c0 [<0>] wb_workfn+0x308/0x3c0 [<0>] process_one_work+0x1e0/0x390 [<0>] worker_thread+0x2b/0x3c0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 [<0>] 0xffffffffffffffff Thread of the fixup worker who is holding the page lock [<0>] start_delalloc_inodes+0x241/0x2d0 [<0>] btrfs_start_delalloc_roots+0x179/0x230 [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0 [<0>] btrfs_check_data_free_space+0x53/0xa0 [<0>] btrfs_delalloc_reserve_space+0x20/0x70 [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0 [<0>] normal_work_helper+0x11c/0x360 [<0>] process_one_work+0x1e0/0x390 [<0>] worker_thread+0x2b/0x3c0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 [<0>] 0xffffffffffffffff Thankfully the stars have to align just right to hit this. First you have to end up in the fixup worker, which is tricky by itself (my reproducer does DIO reads into a MMAP'ed region, so not a common operation). Then you have to have less than a page size of free data space and 0 unallocated space so you go down the "commit the transaction to free up pinned space" path. This was accomplished by a random balance that was running on the host. Then you get this deadlock. I'm still in the process of trying to force the deadlock to happen on demand, but I've hit other issues. I can still trigger the fixup worker path itself so this patch has been tested in that regard, so the normal case is fine. Fixes: 87826df ("btrfs: delalloc for page dirtied out-of-band in fixup worker") Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent a531e6b commit 7a97311

File tree

1 file changed

+60
-16
lines changed

1 file changed

+60
-16
lines changed

fs/btrfs/inode.c

Lines changed: 60 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2168,6 +2168,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
21682168
/* see btrfs_writepage_start_hook for details on why this is required */
21692169
struct btrfs_writepage_fixup {
21702170
struct page *page;
2171+
struct inode *inode;
21712172
struct btrfs_work work;
21722173
};
21732174

@@ -2182,9 +2183,20 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
21822183
u64 page_start;
21832184
u64 page_end;
21842185
int ret = 0;
2186+
bool free_delalloc_space = true;
21852187

21862188
fixup = container_of(work, struct btrfs_writepage_fixup, work);
21872189
page = fixup->page;
2190+
inode = fixup->inode;
2191+
page_start = page_offset(page);
2192+
page_end = page_offset(page) + PAGE_SIZE - 1;
2193+
2194+
/*
2195+
* This is similar to page_mkwrite, we need to reserve the space before
2196+
* we take the page lock.
2197+
*/
2198+
ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
2199+
PAGE_SIZE);
21882200
again:
21892201
lock_page(page);
21902202

@@ -2193,25 +2205,48 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
21932205
* page->mapping may go NULL, but it shouldn't be moved to a different
21942206
* address space.
21952207
*/
2196-
if (!page->mapping || !PageDirty(page) || !PageChecked(page))
2208+
if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
2209+
/*
2210+
* Unfortunately this is a little tricky, either
2211+
*
2212+
* 1) We got here and our page had already been dealt with and
2213+
* we reserved our space, thus ret == 0, so we need to just
2214+
* drop our space reservation and bail. This can happen the
2215+
* first time we come into the fixup worker, or could happen
2216+
* while waiting for the ordered extent.
2217+
* 2) Our page was already dealt with, but we happened to get an
2218+
* ENOSPC above from the btrfs_delalloc_reserve_space. In
2219+
* this case we obviously don't have anything to release, but
2220+
* because the page was already dealt with we don't want to
2221+
* mark the page with an error, so make sure we're resetting
2222+
* ret to 0. This is why we have this check _before_ the ret
2223+
* check, because we do not want to have a surprise ENOSPC
2224+
* when the page was already properly dealt with.
2225+
*/
2226+
if (!ret) {
2227+
btrfs_delalloc_release_extents(BTRFS_I(inode),
2228+
PAGE_SIZE);
2229+
btrfs_delalloc_release_space(inode, data_reserved,
2230+
page_start, PAGE_SIZE,
2231+
true);
2232+
}
2233+
ret = 0;
21972234
goto out_page;
2235+
}
21982236

21992237
/*
2200-
* We keep the PageChecked() bit set until we're done with the
2201-
* btrfs_start_ordered_extent() dance that we do below. That drops and
2202-
* retakes the page lock, so we don't want new fixup workers queued for
2203-
* this page during the churn.
2238+
* We can't mess with the page state unless it is locked, so now that
2239+
* it is locked bail if we failed to make our space reservation.
22042240
*/
2205-
inode = page->mapping->host;
2206-
page_start = page_offset(page);
2207-
page_end = page_offset(page) + PAGE_SIZE - 1;
2241+
if (ret)
2242+
goto out_page;
22082243

22092244
lock_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
22102245
&cached_state);
22112246

22122247
/* already ordered? We're done */
22132248
if (PagePrivate2(page))
2214-
goto out;
2249+
goto out_reserved;
22152250

22162251
ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
22172252
PAGE_SIZE);
@@ -2224,11 +2259,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
22242259
goto again;
22252260
}
22262261

2227-
ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
2228-
PAGE_SIZE);
2229-
if (ret)
2230-
goto out;
2231-
22322262
ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
22332263
&cached_state);
22342264
if (ret)
@@ -2242,12 +2272,12 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
22422272
* The page was dirty when we started, nothing should have cleaned it.
22432273
*/
22442274
BUG_ON(!PageDirty(page));
2275+
free_delalloc_space = false;
22452276
out_reserved:
22462277
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
2247-
if (ret)
2278+
if (free_delalloc_space)
22482279
btrfs_delalloc_release_space(inode, data_reserved, page_start,
22492280
PAGE_SIZE, true);
2250-
out:
22512281
unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end,
22522282
&cached_state);
22532283
out_page:
@@ -2266,6 +2296,12 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
22662296
put_page(page);
22672297
kfree(fixup);
22682298
extent_changeset_free(data_reserved);
2299+
/*
2300+
* As a precaution, do a delayed iput in case it would be the last iput
2301+
* that could need flushing space. Recursing back to fixup worker would
2302+
* deadlock.
2303+
*/
2304+
btrfs_add_delayed_iput(inode);
22692305
}
22702306

22712307
/*
@@ -2303,10 +2339,18 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
23032339
if (!fixup)
23042340
return -EAGAIN;
23052341

2342+
/*
2343+
* We are already holding a reference to this inode from
2344+
* write_cache_pages. We need to hold it because the space reservation
2345+
* takes place outside of the page lock, and we can't trust
2346+
* page->mapping outside of the page lock.
2347+
*/
2348+
ihold(inode);
23062349
SetPageChecked(page);
23072350
get_page(page);
23082351
btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
23092352
fixup->page = page;
2353+
fixup->inode = inode;
23102354
btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
23112355

23122356
return -EAGAIN;

0 commit comments

Comments
 (0)