Skip to content

Conversation

@carlescufi
Copy link
Member

@carlescufi carlescufi commented Jun 4, 2025

The difference between __irq_disable() and irq_lock() is that the former essentially translates to cpsid i, whereas irq_lock() translates to setting BASEPRI (on cores with BASEPRI). This means that using irq_lock() does not disable zero-latency interrupts (ZLIs), which reduces the potential execution latency of ZLIs.

In both isr_wrapper() and _arch_isr_direct_pm() (which is just really an implementation of ISR_DIRECT_PM()), we were using __irq_disable() to disable all interrupts, including ZLIs. But the code executed with interrupts disabled handles waking up from idle, and so must only be protected against regular interrupts being executed, not ZLIs, which should have no effect on the correct execution of the code.

@github-actions github-actions bot added the area: ARM ARM (32-bit) Architecture label Jun 4, 2025
@carlescufi carlescufi changed the title arch: arm: cortex-m: Reduce ZLI latency by not disabling them in wrapper RFC: arch: arm: cortex-m: Reduce ZLI latency by not disabling them in wrapper Jun 4, 2025
@carlescufi carlescufi requested review from bjarki-andreasen and removed request for galak June 4, 2025 17:00
unsigned int key;

/* irq_lock() does what we want for this CPU */
/* Disable all interrupts except ZLIs. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_arch_isr_direct_pm() could be called in ZLI and non-ZLI direct ISRs. A race will lead to pm_system_resume() not being called when it should be or corrupt the data accessed inside the function call, right?

On ARMv6, irq_lock() is nothing but disable all interrupts. On ARMv7, cpsid i needs to be used to disable all interrupts

Zephyr Controller calls ISR_DIRECT_PM() here:

Does this PR mean, no ZLI ISR call ISR_DIRECT_PM() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ZLIs must not call any zephyr APIs since it breaks synchronisation :) pm_system_resume() is a zephyr API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic: Zephyr APIs are not guaranteed to work in a ZLI, and specifically as you point out anything that relies on synchronization is guaranteed not to work (or not to work reliably). But there's lots of good stuff in the Zephyr API that you can use as long as you know what you're doing. Like, Best Practice for communication with a ZLI almost certainly involves the Zephyr atomics API. We've just been too lazy/conservative to try to document where the boundary is.

Copy link
Contributor

@bjarki-andreasen bjarki-andreasen Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, IPC style communication with software IRQs like nordics SWI and shared memory area I think is the only truly safe way of syncing between ZLI and the zephyr kernel :)

Copy link
Member Author

@carlescufi carlescufi Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have never been able to call _arch_isr_direct_pm or pm_system_resume() safely from a ZLI. Starting with the fact that they read and modify kernel structures like _kernel.idle, this was never possible (at least safely). ZLIs can preempt the kernel any time so modifying kernel structures is a no-no. Remember: even though all interrupts were disabled prior to modifying kernel structs and calling pm_system_resume(), it was already too late! The ZLI could have preempted the kernel during an irq_lock() or sched_lock() execution, so you already had a race, regardless of global disabling of interrupts. Then if we go into pm_system_resume() it's even worse: tons of variables are being read and set there. In summary: it does not matter if you disable global interrupts or not, you can never call code from a ZLI that relies on irq_lock() or sched_lock() to prevent races.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nordic-krch agreed! Could you follow-up with #17301 so I can add the assert after?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an assert to ensure that ZLI interrupt does not call things that should not be called in that context. For that we would need API for getting current priority. It was proposed long time ago but did not get in #17301

@nordic-krch Could you expand on that? asserts themselves are a zephyr API, so ZLIs are not even allowed to call those, how would you enforce this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is an issue we can add ZLI_ASSERT that will do while(1) or reset. The thing is that it will be fail immediately.

Copy link
Contributor

@bjarki-andreasen bjarki-andreasen Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does present an issue though, ZLIs are meant to be "above" the zephyr kernel, a place where a user can do "whatever they want", so such a macro would have to be application specific I would think right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's just a loop you could simply disable interrupts and spin. This should work for all applications in principle? Then perhaps it should be overridable.


/* irq_lock() does what we want for this CPU */
/* Disable all interrupts except ZLIs. */
key = irq_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter in context, but while you're here it's probably best to swap this for arch_irq_lock(), which is the same thing on any uniprocessor architecture. But on SMP, "irq_lock()" is actually a complicated global spinlock and "arch_irq_lock()" is the call that locally masks interrupts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, did not know that. Will make the change.

andyross
andyross previously approved these changes Jun 4, 2025
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely sounds correct to me, modulo the inevitable app somewhere that turns out to actually care about this behavior (because of bad assumptions about interrupt priority, maybe). Even then we should probably have the fight, as this wasn't ever really justified.

The difference between __irq_disable() and irq_lock() is that the former
essentially translates to `cpsid i`, whereas `irq_lock()` translates to
setting BASEPRI (on cores with BASEPRI). This means that using
irq_lock() does not disable zero-latency interrupts (ZLIs), which
reduces the potential execution latency of ZLIs.

In both isr_wrapper and _arch_isr_direct_pm() (which is just
really an implementation of ISR_DIRECT_PM()), we were using
__irq_disable() to disable all interrups, including ZLIs. But the code
executed with interrupts disabled handles waking up from idle, and so
must only be protected against regular interrupts being executed, not
ZLIs, which should have no effect on the correct execution of the code.

Signed-off-by: Carles Cufi <[email protected]>
@carlescufi
Copy link
Member Author

@andyross switched to arch_irq_lock() in this last push.

@github-actions github-actions bot requested a review from galak June 4, 2025 20:08
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 4, 2025

@cvinayak cvinayak dismissed their stale review June 5, 2025 09:30

Hope the maintainer reviews use of ISR_DIRECT_PM in the repo, and reviews the change introduce by moving away from use of __irq_disable to using arch_irq_lock . Please also update documentation of ISR_DIRECT_PM and its documentation references like for instance here:

* The major differences with regular Zephyr interrupts are the following:
* - Similar to direct interrupts, the call into the OS to exit power
* management idle state is optional. Normal interrupts always do this
* before the ISR is run, but with dynamic direct ones when and if it runs
* is controlled by the placement of
* a ISR_DIRECT_PM() macro, or omitted entirely.
* - Similar to direct interrupts, scheduling decisions are optional. Unlike
* direct interrupts, the decisions must be made at build time.
* They are controlled by @param resch to this macro.

@carlescufi
Copy link
Member Author

Hope the maintainer reviews use of ISR_DIRECT_PM in the repo, and reviews the change introduce by moving away from use of __irq_disable to using arch_irq_lock . Please also update documentation of ISR_DIRECT_PM and its documentation references like for instance here:

I think the maintainers have already reviewed it. Regarding the doxygen update you requested, I did that in a separate PR:
#91053

@cvinayak
Copy link
Contributor

cvinayak commented Jun 5, 2025

Related Zephyr Controller PR #91119.

Copy link
Contributor

@wearyzen wearyzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much experience with ZLI but the PR and discussions make sense so LGTM.

@carlescufi carlescufi changed the title RFC: arch: arm: cortex-m: Reduce ZLI latency by not disabling them in wrapper arch: arm: cortex-m: Reduce ZLI latency by not disabling them in wrapper Jun 5, 2025
@dleach02 dleach02 merged commit dc006e6 into zephyrproject-rtos:main Jun 6, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: ARM ARM (32-bit) Architecture

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants