Skip to content

Conversation

@Fredi-raspall
Copy link
Contributor

@Fredi-raspall Fredi-raspall commented Oct 10, 2025

There is a follow-up to this PR adding functionality that is later needed (cache refresh, iterators).
However, this is ready for review.

This is part of #909
Fixes #919

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch 3 times, most recently from 25a5fb8 to fd16542 Compare October 13, 2025 15:54
@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch 3 times, most recently from 5de9cd8 to 7fa934e Compare October 15, 2025 22:00
@Fredi-raspall Fredi-raspall added the dont-merge Do not merge this Pull Request label Oct 15, 2025
@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch 2 times, most recently from 5f54e67 to c772a0d Compare October 16, 2025 21:15
@Fredi-raspall Fredi-raspall removed the dont-merge Do not merge this Pull Request label Oct 16, 2025
@Fredi-raspall Fredi-raspall marked this pull request as ready for review October 16, 2025 21:15
@Fredi-raspall Fredi-raspall requested a review from a team as a code owner October 16, 2025 21:15
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@qmonnet qmonnet requested a review from Copilot October 17, 2025 09:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new crate left-right-tlcache that provides thread-local caching of left-right read handles. The solution addresses the challenge of safely sharing read handles across multiple threads when the set of handles may not be known upfront and can change over time.

Key changes:

  • Implements a thread-local cache system for ReadHandle<T> instances
  • Provides traits for read handle provisioning and identity management
  • Includes comprehensive test coverage with various scenarios

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
left-right-tlcache/src/lib.rs Core implementation of thread-local read handle cache with traits, cache logic, and extensive tests
left-right-tlcache/Cargo.toml Package configuration for the new crate with required dependencies
Cargo.toml Workspace configuration updates to include the new crate

// store a new entry locally with a handle, its identity and version, for the given key
let rhandle = Rc::new(fresh);
let entry = ReadHandleEntry::new(identity, Rc::clone(&rhandle), version);
let rhandle = Rc::new(factory.handle());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the race that is being protected against here, nor do I see how this achieves any race protection. Perhaps a conversation about this is in order.

Copy link
Contributor Author

@Fredi-raspall Fredi-raspall Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've updated the commit message to make it more clear.
It is very unlikely that this would happen, but I believe possible:

  1. Suppose you have a thread X with one rhandle and the current version is N
  2. There is provider holding a collection that may change without X's control
  3. Thread X queries the provider to validate its rhandle, by requesting the version. Provider returns N + k.
    (this is to avoid needing to request the identity which would require some more expensive lookup).
  4. Since N + k != N, then thread X needs to request the identity of the rhandle
    ... but between (3) and (4) the contents of the collection handled by the provider may have changed and the version be N+k', k'>k. As a result, thread X would get an rhandle for version N+k' but believe it was for version N+k. Nothing terrific would happen because, on a subsequent check, it would see that N+k != N+k' and re-request the identity again. But the race exists 🤷 ?

Since the fix is trivial, I added it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case (with the provider we use) it may well be that this never happens.

}

// create local cache
make_thread_local_readhandle_cache!(TEST_CACHE, u64, TestStruct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this global, why not per test so that you don't have to serialize things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are very short. I can change that if needed, but we already use the crate to serialize tests.


#[serial]
#[test]
fn test_readhandle_cache() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the actual multithreaded shuttle tests? How do you know this works with actual concurrency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to add them for the fibtable (still have to) in our actual usage of this.
The only evidence available so far is that all the existing tests pass and the issues seen before don't show up.
Will add tests regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I will add those in #918 because this PR is just for the cache.

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch from 79683ab to eda6dab Compare October 17, 2025 21:03
@Fredi-raspall Fredi-raspall added this to the GW R1 milestone Oct 20, 2025
@mvachhar mvachhar added this pull request to the merge queue Oct 21, 2025
@mvachhar mvachhar removed this pull request from the merge queue due to a manual request Oct 21, 2025
@mvachhar mvachhar enabled auto-merge October 21, 2025 17:09
@mvachhar mvachhar disabled auto-merge October 21, 2025 17:09
@mvachhar mvachhar added the ci:+vlab Enable VLAB tests label Oct 21, 2025
@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch from c0accdb to 807bbf3 Compare October 21, 2025 20:34
Add a test provider and several tests.
Tests are serialized because they run in the same thread and access
the same cache. That could be avoided using distinct caches,though.

Signed-off-by: Fredi Raspall <[email protected]>
A FibReaderFactory object allows creating FibReaders.

Signed-off-by: Fredi Raspall <[email protected]>
This commit contains 3 changes:

1) Rename FibId -> FibKey and clarify usage. The old FibId is
   used in two ways: one, to identify a Fib; another, to look up
   fibs in the Fibtable. Rename it to FibKey since, while still
   used as an Id, the main purpose is to act as a key.
   Also, change some internal APIs to make sure that Fibs are
   identified by a single FibKey (based on the vrf id).

2) Restructure FibTable: FibReaders (which wrap ReadHandles<Fib>)
   cannot be shared among threads. Instead of storing a FibReader
   for each Fib, store a FibReaderFactory from which readers
   can be obtained on a per-thread basis. In this patch, every
   time a Fib needs to be accessed, a new FibReader is created,
   which is very inefficient due to the need for locking
   when creating ReadHandles. Subsequent commits will fix this.
   We also keep the Id of a Fib along with its factory. This is
   needed to be able to efficiently implement the traits needed
   for the readhandle cache. On this, note that:
   - the fibtable could keep fibreaders as it did, since new
     can be obtained via cloning. However, keeping instead
     factories protects us from errorneously handing off a reference
     to a FibReader to more than a thread.
   - if we would store fibreaders (and clone them on demand),
     we would not need to store the Id (since it could be obtained
     entering the fib via the readhandle), but the number of
     readers would be increased.

3) Elements in the new FibTable are stored within an Arc. This is
   to ensure that the compiler automatically determines that
   FibTable is Sync.

4) Avoid needing to derive/implement Clone() for FibTable by adding
   a dummy sync_from() Absorb implementation, by publishing on
   FibTable creation.

Signed-off-by: Fredi Raspall <[email protected]>
The thread-local cache of prior commit stores ReadHandle<Fib>.
However, the rest of the code uses transparent wrapper FibReader,
with additional methods. Implement AsRef so that we can have
FibReaders with zero cost.

Signed-off-by: Fredi Raspall <[email protected]>
Add left-right-tlcache as dependency to implement the fibtable
cache.

Implement trait Identity for Fib so that the thread-local cache of
ReadHandle<Fib>s can determine if a read handle points to the right
object or the entry needs to be invalidated.

Signed-off-by: Fredi Raspall <[email protected]>
Versioning the fibtable will allow us to determine the validity
of cached readhandles without needing to ask the provider, which
would add an extra lookup (e.g. hash or similar), on every
cache lookup. This patch adds the version for the FibTable and
the logic to monotonically increment its value anytime it is
updated. Note that version is u64 instead of AtomicU64 since
we don't need to synchronize multiple threads on its value.

Signed-off-by: Fredi Raspall <[email protected]>
Implement AsRef<ReadHandleFactory<Fib>> for FibReaderFactory so
that it can be treated as a ReadHandleFactory<Fib>.

Signed-off-by: Fredi Raspall <[email protected]>
Declare thread-local ReadHandleCache `FIBTABLE_CACHE`.
Implement trait ReadHandleProvider for FibTable so that it can
be polled by the cache, per thread.

We could implement ReadHandleProvider for FibTableReader instead.
However, implementing it for the FibTable is more general in that
it would also work if the FibTable itself was not wrapped in left-
right.

Signed-off-by: Fredi Raspall <[email protected]>
... from Rc<ReadHandle<Fib>>.

The cache stores ReadHandle<T>'s for some T. In the case of the
FibTable, T=Fib and we want to hand off FibReaders (a thin wrapper
of ReadHandle<Fib>) to the threads using the cache. Add a method
to perform such a conversion.

Signed-off-by: Fredi Raspall <[email protected]>
Add the main method to lookup FibReaders from a FibKey from the
thread-local cache and a FibTableReader. The method will check
the local cache to return FibReaders owned by the calling thread.
On failure, it will pull data from provider (fibtable) accessible
from the same FibTableReader.

Signed-off-by: Fredi Raspall <[email protected]>
Let the ipforward NF look up FibReader from thread-local cache.

Signed-off-by: Fredi Raspall <[email protected]>
Extend the thread-local cache with full refresh logic and
iterator. This requires adding a new method to trait
ReadHandleProvider.

Signed-off-by: Fredi Raspall <[email protected]>
impl ReadHandleProvider get_iter() for FibTable. This is needed to
allow the thread handling the cli to safely iterate over the fibs.

Signed-off-by: Fredi Raspall <[email protected]>
The existing iterator implementation is subobtimal in that each
call to get_iter() needs to request the provider a full iterator
for the whole set of read handle factories, unless refresh arg is
set to false, and then refresh the local cache with read handles
built from each factory. This is inefficient in case the set
of objects in the provider rarely changes (which may be the case
in most setups).

Avoid pulling an iterator from the provider if the version did
not change, as this means that no objects have been added or
removed. If anything, we remove unusable readers.

Signed-off-by: Fredi Raspall <[email protected]>
Fix the code so that a cache keeps the minimal number of rhandles
after a refresh. The previous code would keep an independent
rhandle for aliases, meaning that if there are N objects each with
one allias, 2 x N rhandles would exist after a refresh.
Solve this by letting aliases use the same rhandle as their
primary.

Signed-off-by: Fredi Raspall <[email protected]>
Misc small cleanups:
  - adjust method visibility
  - use more reasonable tracing levels
  - remove unused methods
  - clean-up tests
  - move fibgroup drop builder

Signed-off-by: Fredi Raspall <[email protected]>
This is a manual revert of 6ca1bfe.
RefCell cannot be used with more than one thread, even if threads
would only ever borrow immutably the types within an Arc since the
state kept within the RefCell to check borrows is not thread-safe
and mutates, making it unusable.

Signed-off-by: Fredi Raspall <[email protected]>
- clarify documentation.
- ensure that installed fibgroups always have at least 1 fibentry.
- ensure that fibroutes refer at least to one fibgroup.
- Add guarded methods for lpm, that keep a guarded reference to
  the fib entry to use to forward a packet. These are NOT yet used
  and actually bring no extra safety, but allow performing lpm ops
  without explicitly "entering" the left-right protected fib.

Signed-off-by: Fredi Raspall <[email protected]>
Two tests are added:
  - one to validate new methods that provide guarded fibentries.
  - one where multiple workers repeatedly do concurrent lookups
    on the same fib and for the same destination, while another
    thread adds/removes/modifies the route that best matches that
    destination.

Signed-off-by: Fredi Raspall <[email protected]>
While a fib is being deleted, it may be actively looked up to
forward packets from dataplane workers. Under heavy load,
misaligned memory accesses have been observed when a fib is
removed. This patch avoids those by "invalidating" a fib prior
to its deletion and requiring readers to check the validity of the
fib before attempting any lookup.

Signed-off-by: Fredi Raspall <[email protected]>
Also add crate concurrency as dependency for later use.
Note that no test is currently using shuttle/loom at the moment.

Signed-off-by: Fredi Raspall <[email protected]>
@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/tl_left_right_readhandles branch from 7770cdf to 48d3ad6 Compare October 24, 2025 20:25
@Fredi-raspall Fredi-raspall added this pull request to the merge queue Oct 24, 2025
Merged via the queue into main with commit e720b5c Oct 24, 2025
19 checks passed
@Fredi-raspall Fredi-raspall deleted the pr/fredi/tl_left_right_readhandles branch October 24, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:+vlab Enable VLAB tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thread-local cache of left-right Read handles

3 participants