Skip to content

Conversation

@cvinayak
Copy link
Contributor

Add implementation to gracefully handle failure due to high Radio ISR latencies due to higher CPU usage in LLL execution context just before the handling of Radio IRQs. Higher CPU time usage in LLL execution context can be due to slow CPUs, example, nRF51x SoC Series with 16 MHz ARM Cortex-M0 CPU.

@cvinayak
Copy link
Contributor Author

This pull request introduces a robust handling mechanism for radio ISR latency issues during auxiliary PDU reception setup in the Bluetooth controller. It adds logic to detect and gracefully handle missed hard realtime deadlines, reducing the risk of assertion failures and improving system resilience on slower CPUs.

Improvements to radio ISR latency handling:

  • Added a countdown mechanism (latency_skip_countdown) to tolerate up to three consecutive radio setup deadline misses before triggering an assertion, allowing for graceful recovery from transient latency spikes.
  • Implemented logic to gracefully close the radio event and resume scanning if the setup deadline is missed, instead of immediately asserting, which helps maintain system stability.
  • Enhanced debugging and profiling support by providing verbose assertion messages when profiling is enabled, aiding in diagnosing persistent latency issues.

@cvinayak
Copy link
Contributor Author

cvinayak commented Oct 26, 2025

Please analyse the pull request, give a summary, any possible regression failures or drawbacks of it?

Summary

  • This draft PR (98267) changes subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c to handle failures when radio_tmr_start_us() does not return the expected start time (start_us != aux_start_us + 1). The motivation is high Radio ISR latency on slow CPUs (e.g., nRF51 Cortex-M0 @ 16 MHz) that causes missed hard realtime deadlines for auxiliary PDU reception.
  • Instead of immediately asserting on a missed deadline, the patch:
    • Introduces a static latency_skip_countdown (initialized from latency_skip_max = 3) to allow a few consecutive misses.
    • On a miss while countdown > 0: decrements the countdown, gracefully closes the radio event, allocates a NODE_RX_TYPE_EXT_AUX_RELEASE and schedules it to ULL, sets radio ISR to resume primary scanning, calls radio_disable(), and returns.
    • If the countdown reaches zero: triggers an assertion. If CONFIG_BT_CTLR_PROFILE_ISR is enabled it includes profiling data in the assert, otherwise it does LL_ASSERT_ERR(false).
    • When radio_tmr_start_us() succeeds, it resets latency_skip_countdown to the max.
  • Net effect: tolerate a small number of consecutive radio setup misses (gracefully close and continue scanning) instead of immediate hard assert/failure.

Possible regression failures and drawbacks

  1. Lost auxiliary PDUs / incomplete advertiser data

    • Gracefully closing the auxiliary event on misses means auxiliary PDUs will be skipped. This can lead to lost or incomplete extended advertising data (longer AD structures carried in Aux PDUs). Higher-level consumers expecting full extended advertising data might see missing fragments or incomplete data streams.
  2. Masking intermittent timing problems

    • Previously an immediate assert on start_us mismatch flagged the timing issue deterministically. Allowing up to N skips can hide intermittent timing regressions, making debugging and regression detection harder (failures may happen silently until the skip budget is exhausted).
  3. Static countdown is global to the function (shared across contexts)

    • latency_skip_countdown is a static local variable with file/function scope; it is not per-LLL instance. If multiple scan contexts (multiple lll instances) or multiple radios exist, the countdown is shared across all invocations of this function, which may be incorrect:
      • One scanner/radio experiencing a miss will decrement the shared counter and affect other scanners.
      • Better behavior would be a per-lll (per-context) counter so missed deadlines are tracked per-scan-instance.
  4. Potential concurrency / SMP concerns

    • The static countdown variable is not protected by atomic operations. If this code can run on an SMP system or be re-entered from different contexts (rare for these SoCs but possible in general builds), there could be data races on latency_skip_countdown. Even if LLL is only on one CPU in typical Nordic platforms, this is an assumption that reduces generality.
  5. Race about lll->lll_aux usage

    • The patch sets node_rx2->rx_ftr.param = lll and node_rx2->rx_ftr.lll_aux = lll->lll_aux and comments about the ULL assigning lll->lll_aux later. This relies on assumptions about ordering and on the lifetime of lll and lll->lll_aux:
      • If ULL later changes or frees lll->lll_aux, the pointer stored in node_rx2 may become stale.
      • The patch's comment acknowledges the race but assumes ull_scan_aux_release will handle it correctly; this must be validated across all race scenarios (especially corner cases where ULL assigns lll->lll_aux concurrently).
      • If ull_scan_aux_release dereferences lll_aux under certain races, there is a risk of use-after-free or null/dangling pointer access.
  6. Assertion behavior and test visibility

    • When countdown is exhausted the patch still asserts (LL_ASSERT_MSG or LL_ASSERT_ERR). That means on persistent timing failures the behavior remains a hard stop — which preserves safety but may be disruptive in production. If profiling is enabled, it produces more verbose asserts, but otherwise the assert is uninformative.
    • For systems where a softer failure handling is preferred (e.g., degrade scanning instead of assert), this may be undesirable.
  7. Allocation on ISR path

    • The code allocates node_rx2 = ull_pdu_rx_alloc() in the LLL/ISR flow. Depending on how ull_pdu_rx_alloc is implemented, doing allocations in this execution context could be problematic (blocking, failure under low memory). The code does LL_ASSERT_ERR(node_rx2) — which would assert if allocation fails. This was also present in the patch and is a point to watch.
  8. Test coverage gaps

    • This fix targets slow CPU targets (e.g., nRF51). If not tested across:
      • Multiple simultaneous scanners,
      • SMP-enabled builds,
      • Platforms with different radio driver timing semantics,
        then regressions could appear only on those configurations.

Impact and severity

  • Functional impact: medium — can cause missed extended advertising data fragments or occasional missed Aux PDUs; may be acceptable if the intended trade-off is to avoid hard faults on slow CPUs.
  • Stability impact: low–medium — the change avoids immediate asserts on occasional misses, but shared/static counter and races could cause unexpected behavior across contexts.
  • Performance impact: minimal additional runtime overhead, but potential correctness impact as described.

Notes / suggestions (for consideration)

  • Make the skip counter per-LLL/context (e.g., store in the lll struct) rather than a single static so misses are tracked per-scan instance.
  • Consider using an atomic/lock or ensure LLL exclusivity guarantees that protect latency_skip_countdown if SMT/SMP is possible.
  • Add telemetry/logging (CONFIG_BT_CTLR_PROFILE_ISR-like or debug logs) for the skipped events to make intermittent timing issues visible without asserting immediately.
  • Add tests that exercise extended scanning on slow CPU targets and multi-scanner scenarios to verify no dangling pointer or ULL race conditions.
  • Re-evaluate doing memory allocation in the LLL/ISR context; if necessary, move to a deferred action or ensure allocation is non-blocking and safe.

Overall assessment

  • The change is a pragmatic mitigation to prevent hard failures on slow CPUs by gracefully closing auxiliary radio events on occasional timing misses, while still asserting on repeated failures. It reduces immediate crash risk on slow/higher-latency hardware but introduces potential semantic changes (skipped Aux PDUs), a shared static counter that may be wrong for multi-context use, and possible race conditions that should be validated with targeted tests.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds graceful handling for high Radio ISR latencies on slower CPUs (e.g., nRF51x with 16 MHz ARM Cortex-M0) during extended Bluetooth scanning. When the radio setup fails to meet hard real-time deadlines due to CPU limitations, the code now implements a retry mechanism with countdown before asserting, rather than immediately failing.

Key Changes:

  • Implements a countdown-based retry mechanism (up to 3 attempts) when radio setup timing fails
  • Adds graceful radio event cleanup paths for both auxiliary and non-auxiliary scanning contexts
  • Enhances assertion diagnostics with ISR latency information when profiling is enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +363 to +364
const uint8_t latency_skip_max = 3U;
static uint8_t latency_skip_countdown = latency_skip_max;
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static variable latency_skip_countdown is not thread-safe. In interrupt contexts or concurrent execution scenarios, multiple invocations could race on this shared state, leading to incorrect countdown behavior. Consider using atomic operations or re-architecting to avoid shared mutable state across ISR invocations.

Copilot uses AI. Check for mistakes.
@cvinayak cvinayak added the bug The issue is a bug, or the PR is fixing a bug label Oct 27, 2025
@cvinayak cvinayak added this to the v4.4.0 milestone Oct 27, 2025
Add implementation to gracefully handle failure due to high
Radio ISR latencies due to higher CPU usage in LLL execution
context just before the handling of Radio IRQs. Higher CPU
time usage in LLL execution context can be due to slow CPUs,
example, nRF51x SoC Series with 16 MHz ARM Cortex-M0 CPU.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
@cvinayak cvinayak force-pushed the github_extended_scan_slow_cpu_fix branch from 0dbd7c4 to c37debb Compare October 27, 2025 16:02
@zephyrbot zephyrbot added area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Samples Samples labels Oct 27, 2025
@cvinayak
Copy link
Contributor Author

This pull request introduces several improvements and bug fixes to the Bluetooth controller's low-latency features, configuration, and radio event handling. The changes focus on making low-latency options more robust, improving error handling for radio event setup, and cleaning up configuration dependencies.

Low-latency features and configuration:

  • Refactored Kconfig options for low-latency features: BT_CTLR_LOW_LAT_ULL and BT_CTLR_LOW_LAT_ULL_DONE now default to enabled when BT_CTLR_LOW_LAT is set, simplifying configuration and ensuring consistent behavior. Removed unnecessary select statements for better clarity. [1] [2]
  • Removed advanced feature and low-latency related options from overlay_bbc_microbit-bt_ll_sw_split.conf, cleaning up unused or redundant configuration lines.

Radio event handling and error resilience:

  • Improved auxiliary PDU radio setup in lll_scan_aux_isr_aux_setup: Added logic to gracefully handle and recover from missed hard real-time deadlines due to ISR latency, with a countdown mechanism and assertion checks to catch consecutive failures. This reduces the risk of system instability due to ISR timing issues.
  • Updated common_prepare_cb in lll_scan.c to only call lll_prepare_done when not resuming, ensuring correct event lifecycle management.

Low-latency ticker synchronization:

  • In ull_ticker_status_take, enabled mayfly synchronization between LLL and ULL when low-latency mode is active and priorities match, improving timing accuracy and reducing latency in ticker operations.

@cvinayak cvinayak force-pushed the github_extended_scan_slow_cpu_fix branch from ba17b0a to c3f2124 Compare October 27, 2025 19:22
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Bluetooth Controller area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth area: Samples Samples bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants