Skip to content

Commit 259c4b9

Browse files
fdmananakdave
authored andcommitted
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order to remove an inode reference and a directory entry from the log. These are necessary when __btrfs_unlink_inode() is called from the unlink path, but not necessary when it's called from a rename context, because: 1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the inode reference related to the old name, because later in the rename path we call btrfs_log_new_name(), which will drop all inode references from the log and copy all inode references from the subvolume tree to the log tree. So we are doing one unnecessary btree operation which adds additional latency and lock contention in case there are other tasks accessing the log tree; 2) For the btrfs_del_dir_entries_in_log() call, we are now doing the equivalent at btrfs_log_new_name() since the previous patch in the series, that has the subject "btrfs: avoid logging all directory changes during renames". In fact, having __btrfs_unlink_inode() call this function not only adds additional latency and lock contention due to the extra btree operation, but also can make btrfs_log_new_name() unnecessarily log a range item to track the deletion of the old name, since it has no way to known that the directory entry related to the old name was previously logged and already deleted by __btrfs_unlink_inode() through its call to btrfs_del_dir_entries_in_log(). So skip those calls at __btrfs_unlink_inode() when we are doing a rename. Skipping them also allows us now to reduce the duration of time we are pinning a log transaction during renames, which is always beneficial as it's not delaying so much other tasks trying to sync the log tree, in particular we end up not holding the log transaction pinned while adding the new name (adding inode ref, directory entry, etc). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible Just like the previous patch in the series, "btrfs: avoid logging all directory changes during renames", the following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box a using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27399 ms NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%) NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%) NUM_FILES=5000, before patchset: 9241 ms NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%) NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%) NUM_FILES=2000, before patchset: 2550 ms NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%) NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%) NUM_FILES=1000, before patchset: 1088 ms NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%) NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%) The next patch in the series (5/5), also contains dbench results after applying to whole patchset. Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 88d2bee commit 259c4b9

File tree

2 files changed

+59
-115
lines changed

2 files changed

+59
-115
lines changed

fs/btrfs/inode.c

Lines changed: 32 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -4133,9 +4133,18 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
41334133
goto err;
41344134
}
41354135

4136-
btrfs_del_inode_ref_in_log(trans, root, name, name_len, inode,
4137-
dir_ino);
4138-
btrfs_del_dir_entries_in_log(trans, root, name, name_len, dir, index);
4136+
/*
4137+
* If we are in a rename context, we don't need to update anything in the
4138+
* log. That will be done later during the rename by btrfs_log_new_name().
4139+
* Besides that, doing it here would only cause extra unncessary btree
4140+
* operations on the log tree, increasing latency for applications.
4141+
*/
4142+
if (!rename_ctx) {
4143+
btrfs_del_inode_ref_in_log(trans, root, name, name_len, inode,
4144+
dir_ino);
4145+
btrfs_del_dir_entries_in_log(trans, root, name, name_len, dir,
4146+
index);
4147+
}
41394148

41404149
/*
41414150
* If we have a pending delayed iput we could end up with the final iput
@@ -9041,8 +9050,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
90419050
u64 new_idx = 0;
90429051
int ret;
90439052
int ret2;
9044-
bool root_log_pinned = false;
9045-
bool dest_log_pinned = false;
90469053
bool need_abort = false;
90479054

90489055
/*
@@ -9145,29 +9152,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
91459152
BTRFS_I(new_inode), 1);
91469153
}
91479154

9148-
/*
9149-
* Now pin the logs of the roots. We do it to ensure that no other task
9150-
* can sync the logs while we are in progress with the rename, because
9151-
* that could result in an inconsistency in case any of the inodes that
9152-
* are part of this rename operation were logged before.
9153-
*
9154-
* We pin the logs even if at this precise moment none of the inodes was
9155-
* logged before. This is because right after we checked for that, some
9156-
* other task fsyncing some other inode not involved with this rename
9157-
* operation could log that one of our inodes exists.
9158-
*
9159-
* We don't need to pin the logs before the above calls to
9160-
* btrfs_insert_inode_ref(), since those don't ever need to change a log.
9161-
*/
9162-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
9163-
btrfs_pin_log_trans(root);
9164-
root_log_pinned = true;
9165-
}
9166-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID) {
9167-
btrfs_pin_log_trans(dest);
9168-
dest_log_pinned = true;
9169-
}
9170-
91719155
/* src is a subvolume */
91729156
if (old_ino == BTRFS_FIRST_FREE_OBJECTID) {
91739157
ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
@@ -9223,46 +9207,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
92239207
if (new_inode->i_nlink == 1)
92249208
BTRFS_I(new_inode)->dir_index = new_idx;
92259209

9226-
if (root_log_pinned) {
9210+
/*
9211+
* Now pin the logs of the roots. We do it to ensure that no other task
9212+
* can sync the logs while we are in progress with the rename, because
9213+
* that could result in an inconsistency in case any of the inodes that
9214+
* are part of this rename operation were logged before.
9215+
*/
9216+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
9217+
btrfs_pin_log_trans(root);
9218+
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
9219+
btrfs_pin_log_trans(dest);
9220+
9221+
/* Do the log updates for all inodes. */
9222+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
92279223
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
92289224
old_rename_ctx.index, new_dentry->d_parent);
9229-
btrfs_end_log_trans(root);
9230-
root_log_pinned = false;
9231-
}
9232-
if (dest_log_pinned) {
9225+
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
92339226
btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
92349227
new_rename_ctx.index, old_dentry->d_parent);
9228+
9229+
/* Now unpin the logs. */
9230+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
9231+
btrfs_end_log_trans(root);
9232+
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
92359233
btrfs_end_log_trans(dest);
9236-
dest_log_pinned = false;
9237-
}
92389234
out_fail:
9239-
/*
9240-
* If we have pinned a log and an error happened, we unpin tasks
9241-
* trying to sync the log and force them to fallback to a transaction
9242-
* commit if the log currently contains any of the inodes involved in
9243-
* this rename operation (to ensure we do not persist a log with an
9244-
* inconsistent state for any of these inodes or leading to any
9245-
* inconsistencies when replayed). If the transaction was aborted, the
9246-
* abortion reason is propagated to userspace when attempting to commit
9247-
* the transaction. If the log does not contain any of these inodes, we
9248-
* allow the tasks to sync it.
9249-
*/
9250-
if (ret && (root_log_pinned || dest_log_pinned)) {
9251-
if (btrfs_inode_in_log(BTRFS_I(old_dir), fs_info->generation) ||
9252-
btrfs_inode_in_log(BTRFS_I(new_dir), fs_info->generation) ||
9253-
btrfs_inode_in_log(BTRFS_I(old_inode), fs_info->generation) ||
9254-
btrfs_inode_in_log(BTRFS_I(new_inode), fs_info->generation))
9255-
btrfs_set_log_full_commit(trans);
9256-
9257-
if (root_log_pinned) {
9258-
btrfs_end_log_trans(root);
9259-
root_log_pinned = false;
9260-
}
9261-
if (dest_log_pinned) {
9262-
btrfs_end_log_trans(dest);
9263-
dest_log_pinned = false;
9264-
}
9265-
}
92669235
ret2 = btrfs_end_transaction(trans);
92679236
ret = ret ? ret : ret2;
92689237
out_notrans:
@@ -9342,7 +9311,6 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
93429311
int ret;
93439312
int ret2;
93449313
u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
9345-
bool log_pinned = false;
93469314

93479315
if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
93489316
return -EPERM;
@@ -9447,25 +9415,6 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
94479415
if (unlikely(old_ino == BTRFS_FIRST_FREE_OBJECTID)) {
94489416
ret = btrfs_unlink_subvol(trans, old_dir, old_dentry);
94499417
} else {
9450-
/*
9451-
* Now pin the log. We do it to ensure that no other task can
9452-
* sync the log while we are in progress with the rename, as
9453-
* that could result in an inconsistency in case any of the
9454-
* inodes that are part of this rename operation were logged
9455-
* before.
9456-
*
9457-
* We pin the log even if at this precise moment none of the
9458-
* inodes was logged before. This is because right after we
9459-
* checked for that, some other task fsyncing some other inode
9460-
* not involved with this rename operation could log that one of
9461-
* our inodes exists.
9462-
*
9463-
* We don't need to pin the logs before the above call to
9464-
* btrfs_insert_inode_ref(), since that does not need to change
9465-
* a log.
9466-
*/
9467-
btrfs_pin_log_trans(root);
9468-
log_pinned = true;
94699418
ret = __btrfs_unlink_inode(trans, BTRFS_I(old_dir),
94709419
BTRFS_I(d_inode(old_dentry)),
94719420
old_dentry->d_name.name,
@@ -9512,12 +9461,9 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
95129461
if (old_inode->i_nlink == 1)
95139462
BTRFS_I(old_inode)->dir_index = index;
95149463

9515-
if (log_pinned) {
9464+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
95169465
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
95179466
rename_ctx.index, new_dentry->d_parent);
9518-
btrfs_end_log_trans(root);
9519-
log_pinned = false;
9520-
}
95219467

95229468
if (flags & RENAME_WHITEOUT) {
95239469
ret = btrfs_whiteout_for_rename(trans, root, mnt_userns,
@@ -9529,28 +9475,6 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
95299475
}
95309476
}
95319477
out_fail:
9532-
/*
9533-
* If we have pinned the log and an error happened, we unpin tasks
9534-
* trying to sync the log and force them to fallback to a transaction
9535-
* commit if the log currently contains any of the inodes involved in
9536-
* this rename operation (to ensure we do not persist a log with an
9537-
* inconsistent state for any of these inodes or leading to any
9538-
* inconsistencies when replayed). If the transaction was aborted, the
9539-
* abortion reason is propagated to userspace when attempting to commit
9540-
* the transaction. If the log does not contain any of these inodes, we
9541-
* allow the tasks to sync it.
9542-
*/
9543-
if (ret && log_pinned) {
9544-
if (btrfs_inode_in_log(BTRFS_I(old_dir), fs_info->generation) ||
9545-
btrfs_inode_in_log(BTRFS_I(new_dir), fs_info->generation) ||
9546-
btrfs_inode_in_log(BTRFS_I(old_inode), fs_info->generation) ||
9547-
(new_inode &&
9548-
btrfs_inode_in_log(BTRFS_I(new_inode), fs_info->generation)))
9549-
btrfs_set_log_full_commit(trans);
9550-
9551-
btrfs_end_log_trans(root);
9552-
log_pinned = false;
9553-
}
95549478
ret2 = btrfs_end_transaction(trans);
95559479
ret = ret ? ret : ret2;
95569480
out_notrans:

fs/btrfs/tree-log.c

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6810,7 +6810,10 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68106810
u64 old_dir_index, struct dentry *parent)
68116811
{
68126812
struct btrfs_inode *inode = BTRFS_I(d_inode(old_dentry));
6813+
struct btrfs_root *root = inode->root;
68136814
struct btrfs_log_ctx ctx;
6815+
bool log_pinned = false;
6816+
int ret = 0;
68146817

68156818
/*
68166819
* this will force the logging code to walk the dentry chain
@@ -6837,14 +6840,22 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68376840
if (old_dir && old_dir->logged_trans == trans->transid) {
68386841
struct btrfs_root *log = old_dir->root->log_root;
68396842
struct btrfs_path *path;
6840-
int ret;
68416843

68426844
ASSERT(old_dir_index >= BTRFS_DIR_START_INDEX);
68436845

6846+
/*
6847+
* We have two inodes to update in the log, the old directory and
6848+
* the inode that got renamed, so we must pin the log to prevent
6849+
* anyone from syncing the log until we have updated both inodes
6850+
* in the log.
6851+
*/
6852+
log_pinned = true;
6853+
btrfs_pin_log_trans(root);
6854+
68446855
path = btrfs_alloc_path();
68456856
if (!path) {
6846-
btrfs_set_log_full_commit(trans);
6847-
return;
6857+
ret = -ENOMEM;
6858+
goto out;
68486859
}
68496860

68506861
/*
@@ -6874,10 +6885,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68746885
mutex_unlock(&old_dir->log_mutex);
68756886

68766887
btrfs_free_path(path);
6877-
if (ret < 0) {
6878-
btrfs_set_log_full_commit(trans);
6879-
return;
6880-
}
6888+
if (ret < 0)
6889+
goto out;
68816890
}
68826891

68836892
btrfs_init_log_ctx(&ctx, &inode->vfs_inode);
@@ -6890,5 +6899,16 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68906899
* inconsistent state after a rename operation.
68916900
*/
68926901
btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx);
6902+
out:
6903+
if (log_pinned) {
6904+
/*
6905+
* If an error happened mark the log for a full commit because
6906+
* it's not consistent and up to date. Do it before unpinning the
6907+
* log, to avoid any races with someone else trying to commit it.
6908+
*/
6909+
if (ret < 0)
6910+
btrfs_set_log_full_commit(trans);
6911+
btrfs_end_log_trans(root);
6912+
}
68936913
}
68946914

0 commit comments

Comments
 (0)