Skip to content

Conversation

AlexJones0
Copy link

This PR is the first of a series of 3 PRs to improve the OpenTitan Flash Controller emulation and make it more feature complete, with a focus on the features that are needed to support keymgr integration.

This first PR primarily focuses on a making several small fixes and improvements to the existing ot_flash code - you can see much more detail about each change in the commit messages. Aside from this, it also makes a small addition with the implementation of the CTRL_REGWEN register, which hardware uses to disable the CONTROL register while there is an ongoing flash operation.

The connection of recoverable alerts on errors should also fix two OpenTitan flash_ctrl test targets that began failing after recent test changes that expect that certain alerts are fired by the device during the test.

This would just trace an "unsupported" flash error, but still isn't
correct and should be avoided.

Signed-off-by: Alex Jones <[email protected]>
Make some more readable helper functions for checking if the flash
controller is in an operation, or whether it is in an ongoing operation
that has been split across multiple executions (e.g. read size too big
for FIFO).

Signed-off-by: Alex Jones <[email protected]>
Both the op_read and op_prog implementations were wrong in different
ways when the fifos are held in reset:
  - For read operations, the data is still read, it is just not pushed
    to the fifo. Where before the read would resume after the fifo exits
    reset, it now continues to finish the read and immediately clears.
  - For program operations, there was the possibility of an infinite
    loop as the fifos could never exit reset, but this would be caught
    by the initial assertion that the fifo is non-empty. As such, this
    is just a redundant check and so is removed.

Signed-off-by: Alex Jones <[email protected]>
Fix a bug making CTRL_REGWEN writable, presumably due to bundling it
with other REGWEN registers. The `CTRL_REGWEN` is only written by
hardware, locking the `CONTROL` register during a flash operation.

Signed-off-by: Alex Jones <[email protected]>
Based on the data page configuration behaviour and the info page cfg
docs, I had previously assumed that `en=False` for an info page config
meant that no protection rules apply (and thus no errors). This is not
the case. From the RTL, the correct behaviour is instead to forbid any
access (like having all fields as false).

Signed-off-by: Alex Jones <[email protected]>
This commit actually implements the functionality of the CTRL_REGWEN
register, which is set exclusively by hardware to indicate when an
operation is progress, to lock software from writing to the `CONTROL`
register in an attempt to potentially start a controller operation.

Signed-off-by: Alex Jones <[email protected]>
Likely when this was first created it was intended to be used to signal
complete ops with different delays, but currently we use a different
`ot_flash_op_execute` and `ot_flash_op_complete` interface that performs
all operations synchronously where possible.

Since the `INIT` pseudo-operation / command is the only thing that uses
this, refactor the function to be a callback for `init_complete` after
some initialisation delay. Also add a note about refactoring the
misleading `OtFlashOperation` enum.

Signed-off-by: Alex Jones <[email protected]>
@rivos-eblot
Copy link

rivos-eblot commented Sep 4, 2025

Note: I may have missed this point a long time ago, but the "synchronous" implementation (as in that performs
all operations synchronously where possible.
) rings some bells:

QEMU I/O thread is shared by all devices, all vCPUs and is unique. IOW, if CONTROL.num is large, this may be a real issue, as the flash driver needs needs to yield, it won't be interrupted and may cause trouble for concurrent driver racing for the I/O thread. It needs to be evaluated (how long does this take to actually complete) but it might require to use an asynchronous implementation. TBC.

@AlexJones0
Copy link
Author

AlexJones0 commented Sep 4, 2025

QEMU I/O thread is shared by all devices, all vCPUs and is unique. IOW, if CONTROL.num is large, this may be a real issue, as the flash driver needs needs to yield, it won't be interrupted and may cause trouble for concurrent driver racing for the I/O thread. It needs to be evaluated (how long does this take to actually complete) but it might require to use an asynchronous implementation. TBC.

I see, I hadn't considered that. You can only synchronously read or program up to the read/program FIFO size at once (16 words), so even though there is some logic to check permissions and update the backend I think this is still probably okay. If CONTROL.num is larger than 16, then the FIFO limits restrict how much data can be moved in a synchronous operation at once. The operation is then split into multiple synchronous operations based around the FIFO register writes/reads.

And also, with this PR the program FIFO cannot be filled before starting an operation, and so technically the logic is never programming more than one word at a time. It can still read 16 words at once, but I would expect writing to the backend to be more expensive than just reading?

I think the most expensive operation is actually therefore the bank erase, which may indeed take a while to zero out all of the backend data (or maybe it's essentially free, I'm not sure how the backend is implemented)? Perhaps if that does prove an issue we could look into making that specific operation asynchronous, in the same way that the initialize command is already handled?

When a software controller-issued operation causes an error (e.g. read
error, program error, memory protection error, program resolution error,
program type error, etc.) which is recoverable, it should also send a
`recov_err` alert.

Adding this allows the Earlgrey 1.0.0 flash_ctrl_mem_protection_test and
flash_ctrl_test to pass again, as they now check for specific alerts.

Also take the opportunity to improve the alert logic, modelling the
ALERT_TEST register as an interface for transient alerts and adding an
`alert_bm` for persistent latched alerts as is done in e.g. the OTP.
This will be important for unrecoverable faults that can be caused by
hardware operation errors later, which can only be disabled via resets.

Signed-off-by: Alex Jones <[email protected]>
With the fixes that introduce recoverable alerts on SW operation errors,
these two tests that started failing after having the alert catching
mechanism (which expects certain alerts to appear in these tests due to
bad operations) should pass again.

Signed-off-by: Alex Jones <[email protected]>
@rivos-eblot
Copy link

I think the most expensive operation is actually therefore the bank erase, which may indeed take a while to zero out all of the backend data (or maybe it's essentially free, I'm not sure how the backend is implemented)? Perhaps if that does prove an issue we could look into making that specific operation asynchronous, in the same way that the initialize command is already handled?

Yeah I'm not saying it is an actual issue, just need to run a couple of runs and see how long it takes to complete in the worse case (using a host timer for example). Can be tracked as a dedicated issue, it should not prevent from delivering this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants