Skip to content

Commit 8fccebf

Browse files
fdmananakdave
authored andcommitted
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent a31a587 commit 8fccebf

File tree

4 files changed

+91
-32
lines changed

4 files changed

+91
-32
lines changed

fs/btrfs/ctree.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1202,6 +1202,22 @@ struct btrfs_clone_extent_info {
12021202
u64 file_offset;
12031203
char *extent_buf;
12041204
u32 item_size;
1205+
/*
1206+
* Set to true when attempting to replace a file range with a new extent
1207+
* described by this structure, set to false when attempting to clone an
1208+
* existing extent into a file range.
1209+
*/
1210+
bool is_new_extent;
1211+
/* Meaningful only if is_new_extent is true. */
1212+
int qgroup_reserved;
1213+
/*
1214+
* Meaningful only if is_new_extent is true.
1215+
* Used to track how many extent items we have already inserted in a
1216+
* subvolume tree that refer to the extent described by this structure,
1217+
* so that we know when to create a new delayed ref or update an existing
1218+
* one.
1219+
*/
1220+
int insertions;
12051221
};
12061222

12071223
struct btrfs_file_private {

fs/btrfs/file.c

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2583,7 +2583,6 @@ static int btrfs_insert_clone_extent(struct btrfs_trans_handle *trans,
25832583
struct btrfs_key key;
25842584
int slot;
25852585
struct btrfs_ref ref = { 0 };
2586-
u64 ref_offset;
25872586
int ret;
25882587

25892588
if (clone_len == 0)
@@ -2608,6 +2607,8 @@ static int btrfs_insert_clone_extent(struct btrfs_trans_handle *trans,
26082607
extent = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
26092608
btrfs_set_file_extent_offset(leaf, extent, clone_info->data_offset);
26102609
btrfs_set_file_extent_num_bytes(leaf, extent, clone_len);
2610+
if (clone_info->is_new_extent)
2611+
btrfs_set_file_extent_generation(leaf, extent, trans->transid);
26112612
btrfs_mark_buffer_dirty(leaf);
26122613
btrfs_release_path(path);
26132614

@@ -2621,13 +2622,29 @@ static int btrfs_insert_clone_extent(struct btrfs_trans_handle *trans,
26212622
return 0;
26222623

26232624
inode_add_bytes(inode, clone_len);
2624-
btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF,
2625-
clone_info->disk_offset,
2626-
clone_info->disk_len, 0);
2627-
ref_offset = clone_info->file_offset - clone_info->data_offset;
2628-
btrfs_init_data_ref(&ref, root->root_key.objectid,
2629-
btrfs_ino(BTRFS_I(inode)), ref_offset);
2630-
ret = btrfs_inc_extent_ref(trans, &ref);
2625+
2626+
if (clone_info->is_new_extent && clone_info->insertions == 0) {
2627+
key.objectid = clone_info->disk_offset;
2628+
key.type = BTRFS_EXTENT_ITEM_KEY;
2629+
key.offset = clone_info->disk_len;
2630+
ret = btrfs_alloc_reserved_file_extent(trans, root,
2631+
btrfs_ino(BTRFS_I(inode)),
2632+
clone_info->file_offset,
2633+
clone_info->qgroup_reserved,
2634+
&key);
2635+
} else {
2636+
u64 ref_offset;
2637+
2638+
btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF,
2639+
clone_info->disk_offset,
2640+
clone_info->disk_len, 0);
2641+
ref_offset = clone_info->file_offset - clone_info->data_offset;
2642+
btrfs_init_data_ref(&ref, root->root_key.objectid,
2643+
btrfs_ino(BTRFS_I(inode)), ref_offset);
2644+
ret = btrfs_inc_extent_ref(trans, &ref);
2645+
}
2646+
2647+
clone_info->insertions++;
26312648

26322649
return ret;
26332650
}
@@ -2705,7 +2722,8 @@ int btrfs_punch_hole_range(struct inode *inode, struct btrfs_path *path,
27052722
* returned by __btrfs_drop_extents() without having
27062723
* changed anything in the file.
27072724
*/
2708-
if (clone_info && ret && ret != -EOPNOTSUPP)
2725+
if (clone_info && !clone_info->is_new_extent &&
2726+
ret && ret != -EOPNOTSUPP)
27092727
btrfs_abort_transaction(trans, ret);
27102728
break;
27112729
}
@@ -2800,7 +2818,7 @@ int btrfs_punch_hole_range(struct inode *inode, struct btrfs_path *path,
28002818
* than 16Mb would force the full fsync any way (when
28012819
* try_release_extent_mapping() is invoked during page cache truncation.
28022820
*/
2803-
if (clone_info)
2821+
if (clone_info && !clone_info->is_new_extent)
28042822
set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
28052823
&BTRFS_I(inode)->runtime_flags);
28062824

fs/btrfs/inode.c

Lines changed: 46 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -9576,11 +9576,15 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
95769576
return err;
95779577
}
95789578

9579-
static int insert_prealloc_file_extent(struct btrfs_trans_handle *trans,
9579+
static struct btrfs_trans_handle *insert_prealloc_file_extent(
9580+
struct btrfs_trans_handle *trans_in,
95809581
struct inode *inode, struct btrfs_key *ins,
95819582
u64 file_offset)
95829583
{
95839584
struct btrfs_file_extent_item stack_fi;
9585+
struct btrfs_clone_extent_info extent_info;
9586+
struct btrfs_trans_handle *trans = trans_in;
9587+
struct btrfs_path *path;
95849588
u64 start = ins->objectid;
95859589
u64 len = ins->offset;
95869590
int ret;
@@ -9597,10 +9601,41 @@ static int insert_prealloc_file_extent(struct btrfs_trans_handle *trans,
95979601

95989602
ret = btrfs_qgroup_release_data(BTRFS_I(inode), file_offset, len);
95999603
if (ret < 0)
9600-
return ret;
9601-
return insert_reserved_file_extent(trans, BTRFS_I(inode), file_offset,
9602-
&stack_fi, ret);
9604+
return ERR_PTR(ret);
9605+
9606+
if (trans) {
9607+
ret = insert_reserved_file_extent(trans, BTRFS_I(inode),
9608+
file_offset, &stack_fi, ret);
9609+
if (ret)
9610+
return ERR_PTR(ret);
9611+
return trans;
9612+
}
9613+
9614+
extent_info.disk_offset = start;
9615+
extent_info.disk_len = len;
9616+
extent_info.data_offset = 0;
9617+
extent_info.data_len = len;
9618+
extent_info.file_offset = file_offset;
9619+
extent_info.extent_buf = (char *)&stack_fi;
9620+
extent_info.item_size = sizeof(stack_fi);
9621+
extent_info.is_new_extent = true;
9622+
extent_info.qgroup_reserved = ret;
9623+
extent_info.insertions = 0;
9624+
9625+
path = btrfs_alloc_path();
9626+
if (!path)
9627+
return ERR_PTR(-ENOMEM);
9628+
9629+
ret = btrfs_punch_hole_range(inode, path, file_offset,
9630+
file_offset + len - 1, &extent_info,
9631+
&trans);
9632+
btrfs_free_path(path);
9633+
if (ret)
9634+
return ERR_PTR(ret);
9635+
9636+
return trans;
96039637
}
9638+
96049639
static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
96059640
u64 start, u64 num_bytes, u64 min_size,
96069641
loff_t actual_len, u64 *alloc_hint,
@@ -9623,14 +9658,6 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
96239658
if (trans)
96249659
own_trans = false;
96259660
while (num_bytes > 0) {
9626-
if (own_trans) {
9627-
trans = btrfs_start_transaction(root, 3);
9628-
if (IS_ERR(trans)) {
9629-
ret = PTR_ERR(trans);
9630-
break;
9631-
}
9632-
}
9633-
96349661
cur_bytes = min_t(u64, num_bytes, SZ_256M);
96359662
cur_bytes = max(cur_bytes, min_size);
96369663
/*
@@ -9642,11 +9669,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
96429669
cur_bytes = min(cur_bytes, last_alloc);
96439670
ret = btrfs_reserve_extent(root, cur_bytes, cur_bytes,
96449671
min_size, 0, *alloc_hint, &ins, 1, 0);
9645-
if (ret) {
9646-
if (own_trans)
9647-
btrfs_end_transaction(trans);
9672+
if (ret)
96489673
break;
9649-
}
96509674

96519675
/*
96529676
* We've reserved this space, and thus converted it from
@@ -9659,13 +9683,11 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
96599683
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
96609684

96619685
last_alloc = ins.offset;
9662-
ret = insert_prealloc_file_extent(trans, inode, &ins, cur_offset);
9663-
if (ret) {
9686+
trans = insert_prealloc_file_extent(trans, inode, &ins, cur_offset);
9687+
if (IS_ERR(trans)) {
9688+
ret = PTR_ERR(trans);
96649689
btrfs_free_reserved_extent(fs_info, ins.objectid,
96659690
ins.offset, 0);
9666-
btrfs_abort_transaction(trans, ret);
9667-
if (own_trans)
9668-
btrfs_end_transaction(trans);
96699691
break;
96709692
}
96719693

@@ -9728,8 +9750,10 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
97289750
break;
97299751
}
97309752

9731-
if (own_trans)
9753+
if (own_trans) {
97329754
btrfs_end_transaction(trans);
9755+
trans = NULL;
9756+
}
97339757
}
97349758
if (clear_offset < end)
97359759
btrfs_free_reserved_data_space(BTRFS_I(inode), NULL, clear_offset,

fs/btrfs/reflink.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -463,6 +463,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
463463
clone_info.file_offset = new_key.offset;
464464
clone_info.extent_buf = buf;
465465
clone_info.item_size = size;
466+
clone_info.is_new_extent = false;
466467
ret = btrfs_punch_hole_range(inode, path, drop_start,
467468
new_key.offset + datal - 1, &clone_info,
468469
&trans);

0 commit comments

Comments
 (0)