Skip to content

Commit c6b7921

Browse files
bonziniDinh Nguyen
authored andcommitted
KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
[ Upstream commit 844a5fe ] Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM runs the guest with CR0.WP=1, so it must handle supervisor writes specially when pte.u=1/pte.w=0/CR0.WP=0. Such writes cause a fault when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0. When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and restarts execution. This will still cause a user write to fault, while supervisor writes will succeed. User reads will fault spuriously now, and KVM will then flip U and W again in the SPTE (U=1, W=0). User reads will be enabled and supervisor writes disabled, going back to the originary situation where supervisor writes fault spuriously. When SMEP is in effect, however, U=0 will enable kernel execution of this page. To avoid this, KVM also sets NX=1 in the shadow PTE together with U=0. If the guest has not enabled NX, the result is a continuous stream of page faults due to the NX bit being reserved. The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER switch. (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry control, so they do not use user-return notifiers for EFER---if they did, EFER.NX would be forced to the same value as the host). There is another bug in the reserved bit check, which I've split to a separate patch for easier application to stable kernels. Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Reviewed-by: Xiao Guangrong <[email protected]> Fixes: f6577a5 Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent b6a016d commit c6b7921

File tree

2 files changed

+25
-14
lines changed

2 files changed

+25
-14
lines changed

Documentation/virtual/kvm/mmu.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -352,7 +352,8 @@ In the first case there are two additional complications:
352352
- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
353353
the kernel may now execute it. We handle this by also setting spte.nx.
354354
If we get a user fetch or read fault, we'll change spte.u=1 and
355-
spte.nx=gpte.nx back.
355+
spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when
356+
shadow paging is in use.
356357
- if CR4.SMAP is disabled: since the page has been changed to a kernel
357358
page, it can not be reused when CR4.SMAP is enabled. We set
358359
CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,

arch/x86/kvm/vmx.c

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1718,26 +1718,31 @@ static void reload_tss(void)
17181718

17191719
static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
17201720
{
1721-
u64 guest_efer;
1722-
u64 ignore_bits;
1721+
u64 guest_efer = vmx->vcpu.arch.efer;
1722+
u64 ignore_bits = 0;
17231723

1724-
guest_efer = vmx->vcpu.arch.efer;
1724+
if (!enable_ept) {
1725+
/*
1726+
* NX is needed to handle CR0.WP=1, CR4.SMEP=1. Testing
1727+
* host CPUID is more efficient than testing guest CPUID
1728+
* or CR4. Host SMEP is anyway a requirement for guest SMEP.
1729+
*/
1730+
if (boot_cpu_has(X86_FEATURE_SMEP))
1731+
guest_efer |= EFER_NX;
1732+
else if (!(guest_efer & EFER_NX))
1733+
ignore_bits |= EFER_NX;
1734+
}
17251735

17261736
/*
1727-
* NX is emulated; LMA and LME handled by hardware; SCE meaningless
1728-
* outside long mode
1737+
* LMA and LME handled by hardware; SCE meaningless outside long mode.
17291738
*/
1730-
ignore_bits = EFER_NX | EFER_SCE;
1739+
ignore_bits |= EFER_SCE;
17311740
#ifdef CONFIG_X86_64
17321741
ignore_bits |= EFER_LMA | EFER_LME;
17331742
/* SCE is meaningful only in long mode on Intel */
17341743
if (guest_efer & EFER_LMA)
17351744
ignore_bits &= ~(u64)EFER_SCE;
17361745
#endif
1737-
guest_efer &= ~ignore_bits;
1738-
guest_efer |= host_efer & ignore_bits;
1739-
vmx->guest_msrs[efer_offset].data = guest_efer;
1740-
vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
17411746

17421747
clear_atomic_switch_msr(vmx, MSR_EFER);
17431748

@@ -1748,16 +1753,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
17481753
*/
17491754
if (cpu_has_load_ia32_efer ||
17501755
(enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) {
1751-
guest_efer = vmx->vcpu.arch.efer;
17521756
if (!(guest_efer & EFER_LMA))
17531757
guest_efer &= ~EFER_LME;
17541758
if (guest_efer != host_efer)
17551759
add_atomic_switch_msr(vmx, MSR_EFER,
17561760
guest_efer, host_efer);
17571761
return false;
1758-
}
1762+
} else {
1763+
guest_efer &= ~ignore_bits;
1764+
guest_efer |= host_efer & ignore_bits;
17591765

1760-
return true;
1766+
vmx->guest_msrs[efer_offset].data = guest_efer;
1767+
vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
1768+
1769+
return true;
1770+
}
17611771
}
17621772

17631773
static unsigned long segment_base(u16 selector)

0 commit comments

Comments
 (0)