Skip to content

Commit 5bc3295

Browse files
Xunlei PangKAGA-KOKO
authored andcommitted
x86/mce: Handle broadcasted MCE gracefully with kexec
When we are about to kexec a crash kernel and right then and there a broadcasted MCE fires while we're still in the first kernel and while the other CPUs remain in a holding pattern, the #MC handler of the first kernel will timeout and then panic due to never completing MCE synchronization. Handle this in a similar way as to when the CPUs are offlined when that broadcasted MCE happens. [ Boris: rewrote commit message and comments. ] Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Tony Luck <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: [email protected] Cc: linux-edac <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
1 parent 4495c08 commit 5bc3295

File tree

3 files changed

+20
-4
lines changed

3 files changed

+20
-4
lines changed

arch/x86/include/asm/reboot.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ struct machine_ops {
1515
};
1616

1717
extern struct machine_ops machine_ops;
18+
extern int crashing_cpu;
1819

1920
void native_machine_crash_shutdown(struct pt_regs *regs);
2021
void native_machine_shutdown(void);

arch/x86/kernel/cpu/mcheck/mce.c

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
#include <asm/tlbflush.h>
5050
#include <asm/mce.h>
5151
#include <asm/msr.h>
52+
#include <asm/reboot.h>
5253

5354
#include "mce-internal.h"
5455

@@ -1127,9 +1128,22 @@ void do_machine_check(struct pt_regs *regs, long error_code)
11271128
* on Intel.
11281129
*/
11291130
int lmce = 1;
1131+
int cpu = smp_processor_id();
11301132

1131-
/* If this CPU is offline, just bail out. */
1132-
if (cpu_is_offline(smp_processor_id())) {
1133+
/*
1134+
* Cases where we avoid rendezvous handler timeout:
1135+
* 1) If this CPU is offline.
1136+
*
1137+
* 2) If crashing_cpu was set, e.g. we're entering kdump and we need to
1138+
* skip those CPUs which remain looping in the 1st kernel - see
1139+
* crash_nmi_callback().
1140+
*
1141+
* Note: there still is a small window between kexec-ing and the new,
1142+
* kdump kernel establishing a new #MC handler where a broadcasted MCE
1143+
* might not get handled properly.
1144+
*/
1145+
if (cpu_is_offline(cpu) ||
1146+
(crashing_cpu != -1 && crashing_cpu != cpu)) {
11331147
u64 mcgstatus;
11341148

11351149
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);

arch/x86/kernel/reboot.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -765,10 +765,11 @@ void machine_crash_shutdown(struct pt_regs *regs)
765765
#endif
766766

767767

768+
/* This is the CPU performing the emergency shutdown work. */
769+
int crashing_cpu = -1;
770+
768771
#if defined(CONFIG_SMP)
769772

770-
/* This keeps a track of which one is crashing cpu. */
771-
static int crashing_cpu;
772773
static nmi_shootdown_cb shootdown_callback;
773774

774775
static atomic_t waiting_for_crash_ipi;

0 commit comments

Comments
 (0)