Skip to content

Commit 8e4f21f

Browse files
adam900710kdave
authored andcommitted
btrfs: handle unaligned EOF truncation correctly for subpage cases
[BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR*** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@EnD pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K | |///////////////////////////////////| | 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @EnD will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 3649833 commit 8e4f21f

File tree

3 files changed

+98
-45
lines changed

3 files changed

+98
-45
lines changed

fs/btrfs/btrfs_inode.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -547,8 +547,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
547547
struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
548548
const struct fscrypt_str *name, int add_backref, u64 index);
549549
int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry);
550-
int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
551-
int front);
550+
int btrfs_truncate_block(struct btrfs_inode *inode, u64 offset, u64 start, u64 end);
552551

553552
int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
554553
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,

fs/btrfs/file.c

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2611,7 +2611,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
26112611
u64 lockend;
26122612
u64 tail_start;
26132613
u64 tail_len;
2614-
u64 orig_start = offset;
2614+
const u64 orig_start = offset;
2615+
const u64 orig_end = offset + len - 1;
26152616
int ret = 0;
26162617
bool same_block;
26172618
u64 ino_size;
@@ -2642,19 +2643,15 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
26422643
lockend = round_down(offset + len, fs_info->sectorsize) - 1;
26432644
same_block = (BTRFS_BYTES_TO_BLKS(fs_info, offset))
26442645
== (BTRFS_BYTES_TO_BLKS(fs_info, offset + len - 1));
2645-
/*
2646-
* We needn't truncate any block which is beyond the end of the file
2647-
* because we are sure there is no data there.
2648-
*/
26492646
/*
26502647
* Only do this if we are in the same block and we aren't doing the
26512648
* entire block.
26522649
*/
26532650
if (same_block && len < fs_info->sectorsize) {
26542651
if (offset < ino_size) {
26552652
truncated_block = true;
2656-
ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
2657-
0);
2653+
ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
2654+
orig_start, orig_end);
26582655
} else {
26592656
ret = 0;
26602657
}
@@ -2664,7 +2661,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
26642661
/* zero back part of the first block */
26652662
if (offset < ino_size) {
26662663
truncated_block = true;
2667-
ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
2664+
ret = btrfs_truncate_block(BTRFS_I(inode), offset, orig_start, orig_end);
26682665
if (ret) {
26692666
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
26702667
return ret;
@@ -2701,8 +2698,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
27012698
if (tail_start + tail_len < ino_size) {
27022699
truncated_block = true;
27032700
ret = btrfs_truncate_block(BTRFS_I(inode),
2704-
tail_start + tail_len,
2705-
0, 1);
2701+
tail_start + tail_len - 1,
2702+
orig_start, orig_end);
27062703
if (ret)
27072704
goto out_only_mutex;
27082705
}
@@ -2870,6 +2867,8 @@ static int btrfs_zero_range(struct inode *inode,
28702867
int ret;
28712868
u64 alloc_hint = 0;
28722869
const u64 sectorsize = fs_info->sectorsize;
2870+
const u64 orig_start = offset;
2871+
const u64 orig_end = offset + len - 1;
28732872
u64 alloc_start = round_down(offset, sectorsize);
28742873
u64 alloc_end = round_up(offset + len, sectorsize);
28752874
u64 bytes_to_reserve = 0;
@@ -2932,8 +2931,8 @@ static int btrfs_zero_range(struct inode *inode,
29322931
}
29332932
if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
29342933
btrfs_free_extent_map(em);
2935-
ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
2936-
0);
2934+
ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
2935+
orig_start, orig_end);
29372936
if (!ret)
29382937
ret = btrfs_fallocate_update_isize(inode,
29392938
offset + len,
@@ -2964,7 +2963,8 @@ static int btrfs_zero_range(struct inode *inode,
29642963
alloc_start = round_down(offset, sectorsize);
29652964
ret = 0;
29662965
} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
2967-
ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
2966+
ret = btrfs_truncate_block(BTRFS_I(inode), offset,
2967+
orig_start, orig_end);
29682968
if (ret)
29692969
goto out;
29702970
} else {
@@ -2981,8 +2981,8 @@ static int btrfs_zero_range(struct inode *inode,
29812981
alloc_end = round_up(offset + len, sectorsize);
29822982
ret = 0;
29832983
} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
2984-
ret = btrfs_truncate_block(BTRFS_I(inode), offset + len,
2985-
0, 1);
2984+
ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
2985+
orig_start, orig_end);
29862986
if (ret)
29872987
goto out;
29882988
} else {
@@ -3102,7 +3102,8 @@ static long btrfs_fallocate(struct file *file, int mode,
31023102
* need to zero out the end of the block if i_size lands in the
31033103
* middle of a block.
31043104
*/
3105-
ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0);
3105+
ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size,
3106+
inode->i_size, (u64)-1);
31063107
if (ret)
31073108
goto out;
31083109
}

fs/btrfs/inode.c

Lines changed: 80 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4775,20 +4775,34 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
47754775
return ret;
47764776
}
47774777

4778+
static bool is_inside_block(u64 bytenr, u64 blockstart, u32 blocksize)
4779+
{
4780+
ASSERT(IS_ALIGNED(blockstart, blocksize), "blockstart=%llu blocksize=%u",
4781+
blockstart, blocksize);
4782+
4783+
if (blockstart <= bytenr && bytenr <= blockstart + blocksize - 1)
4784+
return true;
4785+
return false;
4786+
}
4787+
47784788
/*
4779-
* Read, zero a chunk and write a block.
4789+
* Handle the truncation of a fs block.
47804790
*
4781-
* @inode - inode that we're zeroing
4782-
* @from - the offset to start zeroing
4783-
* @len - the length to zero, 0 to zero the entire range respective to the
4784-
* offset
4785-
* @front - zero up to the offset instead of from the offset on
4791+
* @inode - inode that we're zeroing
4792+
* @offset - the file offset of the block to truncate
4793+
* The value must be inside [@start, @end], and the function will do
4794+
* extra checks if the block that covers @offset needs to be zeroed.
4795+
* @start - the start file offset of the range we want to zero
4796+
* @end - the end (inclusive) file offset of the range we want to zero.
47864797
*
4787-
* This will find the block for the "from" offset and cow the block and zero the
4788-
* part we want to zero. This is used with truncate and hole punching.
4798+
* If the range is not block aligned, read out the folio that covers @offset,
4799+
* and if needed zero blocks that are inside the folio and covered by [@start, @end).
4800+
* If @start or @end + 1 lands inside a block, that block will be marked dirty
4801+
* for writeback.
4802+
*
4803+
* This is utilized by hole punch, zero range, file expansion.
47894804
*/
4790-
int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
4791-
int front)
4805+
int btrfs_truncate_block(struct btrfs_inode *inode, u64 offset, u64 start, u64 end)
47924806
{
47934807
struct btrfs_fs_info *fs_info = inode->root->fs_info;
47944808
struct address_space *mapping = inode->vfs_inode.i_mapping;
@@ -4798,20 +4812,49 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
47984812
struct extent_changeset *data_reserved = NULL;
47994813
bool only_release_metadata = false;
48004814
u32 blocksize = fs_info->sectorsize;
4801-
pgoff_t index = from >> PAGE_SHIFT;
4802-
unsigned offset = from & (blocksize - 1);
4815+
pgoff_t index = (offset >> PAGE_SHIFT);
48034816
struct folio *folio;
48044817
gfp_t mask = btrfs_alloc_write_mask(mapping);
48054818
size_t write_bytes = blocksize;
48064819
int ret = 0;
4820+
const bool in_head_block = is_inside_block(offset, round_down(start, blocksize),
4821+
blocksize);
4822+
const bool in_tail_block = is_inside_block(offset, round_down(end, blocksize),
4823+
blocksize);
4824+
bool need_truncate_head = false;
4825+
bool need_truncate_tail = false;
4826+
u64 zero_start;
4827+
u64 zero_end;
48074828
u64 block_start;
48084829
u64 block_end;
48094830

4810-
if (IS_ALIGNED(offset, blocksize) &&
4811-
(!len || IS_ALIGNED(len, blocksize)))
4831+
/* @offset should be inside the range. */
4832+
ASSERT(start <= offset && offset <= end, "offset=%llu start=%llu end=%llu",
4833+
offset, start, end);
4834+
4835+
/* The range is aligned at both ends. */
4836+
if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize))
4837+
goto out;
4838+
4839+
/*
4840+
* @offset may not be inside the head nor tail block. In that case we
4841+
* don't need to do anything.
4842+
*/
4843+
if (!in_head_block && !in_tail_block)
4844+
goto out;
4845+
4846+
/*
4847+
* Skip the truncatioin if the range in the target block is already aligned.
4848+
* The seemingly complex check will also handle the same block case.
4849+
*/
4850+
if (in_head_block && !IS_ALIGNED(start, blocksize))
4851+
need_truncate_head = true;
4852+
if (in_tail_block && !IS_ALIGNED(end + 1, blocksize))
4853+
need_truncate_tail = true;
4854+
if (!need_truncate_head && !need_truncate_tail)
48124855
goto out;
48134856

4814-
block_start = round_down(from, blocksize);
4857+
block_start = round_down(offset, blocksize);
48154858
block_end = block_start + blocksize - 1;
48164859

48174860
ret = btrfs_check_data_free_space(inode, &data_reserved, block_start,
@@ -4891,17 +4934,26 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
48914934
goto out_unlock;
48924935
}
48934936

4894-
if (offset != blocksize) {
4895-
if (!len)
4896-
len = blocksize - offset;
4897-
if (front)
4898-
folio_zero_range(folio, block_start - folio_pos(folio),
4899-
offset);
4900-
else
4901-
folio_zero_range(folio,
4902-
(block_start - folio_pos(folio)) + offset,
4903-
len);
4937+
if (end == (u64)-1) {
4938+
/*
4939+
* We're truncating beyond EOF, the remaining blocks normally are
4940+
* already holes thus no need to zero again, but it's possible for
4941+
* fs block size < page size cases to have memory mapped writes
4942+
* to pollute ranges beyond EOF.
4943+
*
4944+
* In that case although such polluted blocks beyond EOF will
4945+
* not reach disk, it still affects our page caches.
4946+
*/
4947+
zero_start = max_t(u64, folio_pos(folio), start);
4948+
zero_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
4949+
end);
4950+
} else {
4951+
zero_start = max_t(u64, block_start, start);
4952+
zero_end = min_t(u64, block_end, end);
49044953
}
4954+
folio_zero_range(folio, zero_start - folio_pos(folio),
4955+
zero_end - zero_start + 1);
4956+
49054957
btrfs_folio_clear_checked(fs_info, folio, block_start,
49064958
block_end + 1 - block_start);
49074959
btrfs_folio_set_dirty(fs_info, folio, block_start,
@@ -5003,7 +5055,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
50035055
* rest of the block before we expand the i_size, otherwise we could
50045056
* expose stale data.
50055057
*/
5006-
ret = btrfs_truncate_block(inode, oldsize, 0, 0);
5058+
ret = btrfs_truncate_block(inode, oldsize, oldsize, -1);
50075059
if (ret)
50085060
return ret;
50095061

@@ -7637,7 +7689,8 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
76377689
btrfs_end_transaction(trans);
76387690
btrfs_btree_balance_dirty(fs_info);
76397691

7640-
ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, 0, 0);
7692+
ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size,
7693+
inode->vfs_inode.i_size, (u64)-1);
76417694
if (ret)
76427695
goto out;
76437696
trans = btrfs_start_transaction(root, 1);

0 commit comments

Comments
 (0)