Skip to content

Commit ba806c9

Browse files
shroffniChristoph Hellwig
authored andcommitted
nvme: correctly account for namespace head reference counter
The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: [email protected] Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: 6218863 ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: [email protected] Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
1 parent 2e96d2d commit ba806c9

File tree

2 files changed

+19
-2
lines changed

2 files changed

+19
-2
lines changed

drivers/nvme/host/core.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4086,6 +4086,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
40864086
struct nvme_ns *ns;
40874087
struct gendisk *disk;
40884088
int node = ctrl->numa_node;
4089+
bool last_path = false;
40894090

40904091
ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
40914092
if (!ns)
@@ -4178,9 +4179,22 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
41784179
out_unlink_ns:
41794180
mutex_lock(&ctrl->subsys->lock);
41804181
list_del_rcu(&ns->siblings);
4181-
if (list_empty(&ns->head->list))
4182+
if (list_empty(&ns->head->list)) {
41824183
list_del_init(&ns->head->entry);
4184+
/*
4185+
* If multipath is not configured, we still create a namespace
4186+
* head (nshead), but head->disk is not initialized in that
4187+
* case. As a result, only a single reference to nshead is held
4188+
* (via kref_init()) when it is created. Therefore, ensure that
4189+
* we do not release the reference to nshead twice if head->disk
4190+
* is not present.
4191+
*/
4192+
if (ns->head->disk)
4193+
last_path = true;
4194+
}
41834195
mutex_unlock(&ctrl->subsys->lock);
4196+
if (last_path)
4197+
nvme_put_ns_head(ns->head);
41844198
nvme_put_ns_head(ns->head);
41854199
out_cleanup_disk:
41864200
put_disk(disk);

drivers/nvme/host/multipath.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -690,8 +690,8 @@ static void nvme_remove_head(struct nvme_ns_head *head)
690690
nvme_cdev_del(&head->cdev, &head->cdev_device);
691691
synchronize_srcu(&head->srcu);
692692
del_gendisk(head->disk);
693-
nvme_put_ns_head(head);
694693
}
694+
nvme_put_ns_head(head);
695695
}
696696

697697
static void nvme_remove_head_work(struct work_struct *work)
@@ -1291,6 +1291,9 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
12911291
{
12921292
bool remove = false;
12931293

1294+
if (!head->disk)
1295+
return;
1296+
12941297
mutex_lock(&head->subsys->lock);
12951298
/*
12961299
* We are called when all paths have been removed, and at that point

0 commit comments

Comments
 (0)