Skip to content

Commit 1b7558e

Browse files
KAGA-KOKOIngo Molnar
authored andcommitted
futexes: fix fault handling in futex_lock_pi
This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Andrew Morton <[email protected]> Cc: David Holmes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)
1 parent 481c534 commit 1b7558e

File tree

1 file changed

+73
-20
lines changed

1 file changed

+73
-20
lines changed

kernel/futex.c

Lines changed: 73 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1096,48 +1096,100 @@ static void unqueue_me_pi(struct futex_q *q)
10961096
* private futexes.
10971097
*/
10981098
static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
1099-
struct task_struct *newowner)
1099+
struct task_struct *newowner,
1100+
struct rw_semaphore *fshared)
11001101
{
11011102
u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
11021103
struct futex_pi_state *pi_state = q->pi_state;
1104+
struct task_struct *oldowner = pi_state->owner;
11031105
u32 uval, curval, newval;
1104-
int ret;
1106+
int ret, attempt = 0;
11051107

11061108
/* Owner died? */
1109+
if (!pi_state->owner)
1110+
newtid |= FUTEX_OWNER_DIED;
1111+
1112+
/*
1113+
* We are here either because we stole the rtmutex from the
1114+
* pending owner or we are the pending owner which failed to
1115+
* get the rtmutex. We have to replace the pending owner TID
1116+
* in the user space variable. This must be atomic as we have
1117+
* to preserve the owner died bit here.
1118+
*
1119+
* Note: We write the user space value _before_ changing the
1120+
* pi_state because we can fault here. Imagine swapped out
1121+
* pages or a fork, which was running right before we acquired
1122+
* mmap_sem, that marked all the anonymous memory readonly for
1123+
* cow.
1124+
*
1125+
* Modifying pi_state _before_ the user space value would
1126+
* leave the pi_state in an inconsistent state when we fault
1127+
* here, because we need to drop the hash bucket lock to
1128+
* handle the fault. This might be observed in the PID check
1129+
* in lookup_pi_state.
1130+
*/
1131+
retry:
1132+
if (get_futex_value_locked(&uval, uaddr))
1133+
goto handle_fault;
1134+
1135+
while (1) {
1136+
newval = (uval & FUTEX_OWNER_DIED) | newtid;
1137+
1138+
curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
1139+
1140+
if (curval == -EFAULT)
1141+
goto handle_fault;
1142+
if (curval == uval)
1143+
break;
1144+
uval = curval;
1145+
}
1146+
1147+
/*
1148+
* We fixed up user space. Now we need to fix the pi_state
1149+
* itself.
1150+
*/
11071151
if (pi_state->owner != NULL) {
11081152
spin_lock_irq(&pi_state->owner->pi_lock);
11091153
WARN_ON(list_empty(&pi_state->list));
11101154
list_del_init(&pi_state->list);
11111155
spin_unlock_irq(&pi_state->owner->pi_lock);
1112-
} else
1113-
newtid |= FUTEX_OWNER_DIED;
1156+
}
11141157

11151158
pi_state->owner = newowner;
11161159

11171160
spin_lock_irq(&newowner->pi_lock);
11181161
WARN_ON(!list_empty(&pi_state->list));
11191162
list_add(&pi_state->list, &newowner->pi_state_list);
11201163
spin_unlock_irq(&newowner->pi_lock);
1164+
return 0;
11211165

11221166
/*
1123-
* We own it, so we have to replace the pending owner
1124-
* TID. This must be atomic as we have preserve the
1125-
* owner died bit here.
1167+
* To handle the page fault we need to drop the hash bucket
1168+
* lock here. That gives the other task (either the pending
1169+
* owner itself or the task which stole the rtmutex) the
1170+
* chance to try the fixup of the pi_state. So once we are
1171+
* back from handling the fault we need to check the pi_state
1172+
* after reacquiring the hash bucket lock and before trying to
1173+
* do another fixup. When the fixup has been done already we
1174+
* simply return.
11261175
*/
1127-
ret = get_futex_value_locked(&uval, uaddr);
1176+
handle_fault:
1177+
spin_unlock(q->lock_ptr);
11281178

1129-
while (!ret) {
1130-
newval = (uval & FUTEX_OWNER_DIED) | newtid;
1179+
ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++);
11311180

1132-
curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
1181+
spin_lock(q->lock_ptr);
11331182

1134-
if (curval == -EFAULT)
1135-
ret = -EFAULT;
1136-
if (curval == uval)
1137-
break;
1138-
uval = curval;
1139-
}
1140-
return ret;
1183+
/*
1184+
* Check if someone else fixed it for us:
1185+
*/
1186+
if (pi_state->owner != oldowner)
1187+
return 0;
1188+
1189+
if (ret)
1190+
return ret;
1191+
1192+
goto retry;
11411193
}
11421194

11431195
/*
@@ -1507,7 +1559,7 @@ static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshared,
15071559
* that case:
15081560
*/
15091561
if (q.pi_state->owner != curr)
1510-
ret = fixup_pi_state_owner(uaddr, &q, curr);
1562+
ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
15111563
} else {
15121564
/*
15131565
* Catch the rare case, where the lock was released
@@ -1539,7 +1591,8 @@ static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshared,
15391591
int res;
15401592

15411593
owner = rt_mutex_owner(&q.pi_state->pi_mutex);
1542-
res = fixup_pi_state_owner(uaddr, &q, owner);
1594+
res = fixup_pi_state_owner(uaddr, &q, owner,
1595+
fshared);
15431596

15441597
/* propagate -EFAULT, if the fixup failed */
15451598
if (res)

0 commit comments

Comments
 (0)