Skip to content

intermittent SMP crashes on x86_64 #21317

@andrewboie

Description

@andrewboie

I'm seeing some sporadic crashes on x86_64.

These crashes seem to have the following characteristics:

  1. Instruction pointer (RIP) is NULL
  2. It seems to happen when main is creating new child threads to run test cases, but I haven't been able to pinpoint where or get a stack trace

Here's an example, but I have seen this occur in a lot of tests:

*** Booting Zephyr OS build zephyr-v2.1.0-238-g5abb770487f7  ***
Running test suite test_sprintf
===================================================================
starting test - test_sprintf_double
SKIP - test_sprintf_double
===================================================================
starting test - test_sprintf_integer
E: ***** CPU Page Fault (error code 0x0000000000000010)
E: Supervisor thread executed address 0x0000000000000000
E: PML4E: 0x000000000011a827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000119827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000118827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x0000000000000008 RBX: 0x0000000000000000 RCX: 0x00000000000f4240 RDX: 0x0000000000000000
E: RSI: 0x0000000000127000 RDI: 0x0000000000002710 RBP: 0x0000000000000000 RSP: 0x0000000000126fb0
E:  R8: 0x000000000011cd0c  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000001000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000126fb0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x000000000010a000
E: call trace:
E: RIP: 0x0000000000000000
E: NULL base ptr
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000011c8a0 (main)
E: Halting system

Started noticing this after I enabled boot page tables. It's unclear whether my work introduced this, or this was an issue that was already present, although I'm starting to suspect the latter since the code I brought in works great for 32-bit.

Due to sanitycheck automatic retries of failed test cases (see #14173) this has gone undetected in CI.

Metadata

Metadata

Labels

area: SMPSymmetric multiprocessingarea: X86_64x86-64 Architecture (64-bit)bugThe issue is a bug, or the PR is fixing a bugpriority: mediumMedium impact/importance bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions