Skip to content

Conversation

@byrnedj
Copy link

@byrnedj byrnedj commented Sep 8, 2023

The main improvements are:

  1. bulk add/remove from mmContainer
  2. bulk eviction
  3. better scheduling of background threads and calculating batch sizes
  4. improved stats (including those from https://github.com/guptask/CacheLib/tree/dsa_bgknd_bulk_evictions)
  5. ability for background workers to evict/promote without mark moving bit - we can use acquire instead

The results are - nearly 20% improvement in multi-tier (develop branch) performance with this config:
simple_tiers_test_bg.txt


This change is Reviewable

Copy link

@igchor igchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 29 files at r1.
Reviewable status: 14 of 29 files reviewed, 8 unresolved discussions (waiting on @byrnedj)


cachelib/allocator/BackgroundMover-inl.h line 82 at r1 (raw file):

  auto assignedMemory = mutex.lock_combine([this] { return assignedMemory_; });

  while (true) {

Why? Isn't it better to let PeriodicWorker schedule the work?


cachelib/allocator/BackgroundMoverStrategy.h line 61 at r1 (raw file):

};

/*

Can you remove it if it's not needed anymore?


cachelib/allocator/Cache.h line 84 at r1 (raw file):

    bool operator<(const MemoryDescriptorType& rhs) {
        if (this->tid_ < rhs.tid_) return true;

nit: you can implement this like this:
return std::make_tuple(tid_, pid_, cid_) < std::make_tuple(rhs.tid_, rhs.pid_, rhs.cid)

Same is true for operator==.


cachelib/allocator/Cache.h line 108 at r1 (raw file):

};

inline bool operator<(const MemoryDescriptorType& lhs, const MemoryDescriptorType& rhs) {

nit: I don't think you need to implement this operator as a free function if you already have a member operator<


cachelib/allocator/Cache.h line 119 at r1 (raw file):

  return false;
}
inline bool operator==(const MemoryDescriptorType& lhs, MemoryDescriptorType& rhs) {

nit: same here


cachelib/allocator/CacheAllocator.h line 2075 at r1 (raw file):

    uint32_t currItems = mmContainer.size();
    if (currItems < batch) {
        batch = currItems/2;

why / 2?


cachelib/allocator/CacheAllocator-inl.h line 444 at r1 (raw file):

      return memory;
    } else {
      if (memory.size() != 0) {

I would suggest using some wrapper for freeing the memory - if you do this manually, you need to be very careful about potential exceptions.


cachelib/allocator/CacheAllocator-inl.h line 1806 at r1 (raw file):

                                             bool markMoving,
                                             bool fromBgThread) {
  std::vector<Item*> newAllocs;

I'm not sure if this makes a big difference but with so many vectors we will have quite a lot of dynamic memory allocations which can impact performance. You might want to consider putting all of those variables into a tuple/struct and having just a single vector (which you should reserve()).


cachelib/allocator/memory/MemoryPool.cpp line 292 at r1 (raw file):

  // allocation class ids that need slab to initiate an allocation.
  LockHolder l(lock_);
  auto allocs2 = ac.allocateBatch(remain);

I don't quite get the logic here. If we failed to make batch allocations before why do you expect it to succeed now?

Copy link

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 29 files at r1, all commit messages.
Reviewable status: 14 of 29 files reviewed, 11 unresolved discussions (waiting on @byrnedj and @igchor)


cachelib/CMakeLists.txt line 46 at r1 (raw file):

option(BUILD_TESTS "If enabled, compile the tests." ON)
option(WITH_DTO "If enabled, build with DSA transparent offloading." OFF)

nit: I would add BUILD_ prefix to WITH_DTO: BUILD_WITH_DTO.


cachelib/allocator/BackgroundMover.h line 68 at r1 (raw file):

  BackgroundMoverStats getStats() const noexcept;
  std::map<MemoryDescriptorType,uint64_t> getClassStats() const noexcept;

nit: I would add an alias for the std::map<MemoryDescriptorType,uint64_t>, e.g. using class_moves_map_type = std::map<MemoryDescriptorType,uint64_t>.
And use this alias everywhere, to avoid bulk renaming (like right now you need to update multiple places) in the future.


cachelib/allocator/MMTinyLFU.h line 330 at r1 (raw file):

    //          is unchanged.
    bool add(T& node) noexcept;
    bool addBatch(std::vector<T*>& nodes) noexcept;

nit: better practice to pass begin and end iterators to make interface container-independent. For example:

template< class InputIt>
bool addBatch(InputIt first, InputIt last) noexcept;

Copy link

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 29 files reviewed, 12 unresolved discussions (waiting on @byrnedj and @igchor)


cachelib/allocator/MMTinyLFU-inl.h line 218 at r1 (raw file):

template <typename T, MMTinyLFU::Hook<T> T::*HookPtr>
bool MMTinyLFU::Container<T, HookPtr>::addBatch(std::vector<T*>& nodes) noexcept {
    return false;

why it is not implemented?

Copy link

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 29 files reviewed, 14 unresolved discussions (waiting on @byrnedj and @igchor)


cachelib/allocator/BackgroundMover-inl.h line 39 at r1 (raw file):

                            std::memory_order_relaxed);
  maxTraversalTimeNs_.store(std::max(maxTraversalTimeNs_.load(), nsTaken),
                            std::memory_order_relaxed);

The implementation looks like not thread-safe and as I can understand recordTraversalTime is called by a single thread. If so, why we need to use std::atomic<> to store the statistic?


cachelib/allocator/BackgroundMover-inl.h line 102 at r1 (raw file):

    auto end = util::getCurrentTimeNs();
    if (moves > 0) {
      traversalStats_.recordTraversalTime(end > begin ? end - begin : 0);

why do we need to check end > begin? is it for overflow check? if so, we need better handling for overflow: at least it is better to skip rather than write 0.
If it is not about overflow, could you please describe when it is possible?

Copy link
Author

@byrnedj byrnedj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 29 files reviewed, 6 unresolved discussions (waiting on @igchor and @vinser52)


cachelib/CMakeLists.txt line 46 at r1 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

nit: I would add BUILD_ prefix to WITH_DTO: BUILD_WITH_DTO.

done


cachelib/allocator/BackgroundMover-inl.h line 39 at r1 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

The implementation looks like not thread-safe and as I can understand recordTraversalTime is called by a single thread. If so, why we need to use std::atomic<> to store the statistic?

You're right - these don't need to be atomics - I have updated them and some other counters since no other thread will access them.


cachelib/allocator/BackgroundMover-inl.h line 82 at r1 (raw file):

Previously, igchor (Igor Chorążewicz) wrote…

Why? Isn't it better to let PeriodicWorker schedule the work?

I find that I get 20% more background evictions with this loop with translates to some improvement in throughput. The worker will run as long as there is work to do - it avoids jumping out and being rescheduled in the case that there are still items to evict.


cachelib/allocator/BackgroundMover-inl.h line 102 at r1 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

why do we need to check end > begin? is it for overflow check? if so, we need better handling for overflow: at least it is better to skip rather than write 0.
If it is not about overflow, could you please describe when it is possible?

It is just for safety - so that the subtraction of end-begin will be greater than 0 and recordTraversalTime will always take a non-negative value. I did just copy this logic from Reaper thread.


cachelib/allocator/MMTinyLFU-inl.h line 218 at r1 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

why it is not implemented?

Done.


cachelib/allocator/memory/MemoryPool.cpp line 292 at r1 (raw file):

Previously, igchor (Igor Chorążewicz) wrote…

I don't quite get the logic here. If we failed to make batch allocations before why do you expect it to succeed now?

because we hold the lock - this is the logic from the original allocate call

@igchor igchor requested a review from vinser52 November 9, 2023 19:02
@igchor
Copy link

igchor commented Nov 20, 2023

@byrnedj could you please split the changes logically into commits (and remove commits with fixes after review, etc.)? I would prefer not to squash all of them on merge, so that it's easier to upstream it.

Also, would it be possible to share the implementation of findEvictionBatch with findEviction? Or should we consider that after merging this and rebasing (since there are some change in upstream I think).

Copy link

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r3, all commit messages.
Reviewable status: 7 of 30 files reviewed, 7 unresolved discussions (waiting on @byrnedj and @igchor)


cachelib/allocator/BackgroundMover.h line 56 at r3 (raw file):

class BackgroundMover : public PeriodicWorker {
 public:
  using ClassBgStatsType = std::map<MemoryDescriptorType,uint64_t>;

do you really need an ordered map?


cachelib/allocator/BackgroundMover.h line 81 at r3 (raw file):

    void recordTraversalTime(uint64_t nsTaken);

    uint64_t getAvgTraversalTimeNs(uint64_t numTraversals) const;

Why do we need to pass numTraversals as a parameter? Why we cannot use the numTraversals_ member?


cachelib/allocator/BackgroundMover-inl.h line 38 at r3 (raw file):

  minTraversalTimeNs_ = std::min(minTraversalTimeNs_, nsTaken);
  maxTraversalTimeNs_ = std::max(maxTraversalTimeNs_, nsTaken);
  totalTraversalTimeNs_ += nsTaken;

Should we also increment numTraversals_?


cachelib/allocator/BackgroundMover-inl.h line 44 at r3 (raw file):

uint64_t BackgroundMover<CacheT>::TraversalStats::getAvgTraversalTimeNs(
    uint64_t numTraversals) const {
  return numTraversals ? totalTraversalTimeNs_ / numTraversals : 0;

if numTraversals equals 0 can we use the numTraversals_ member and return totalTraversalTimeNs_ / numTraversals_?


cachelib/allocator/BackgroundMover-inl.h line 131 at r3 (raw file):

template <typename CacheT>
std::map<MemoryDescriptorType,uint64_t>

Use the ClassBgStatsType alias instead of std::map<MemoryDescriptorType,uint64_t>


cachelib/allocator/BackgroundMoverStrategy.h line 51 at r3 (raw file):

          batches.push_back(batch);
        } else {
          batches.push_back(0);

what does batch size equal 0 mean? Why we should add 0 to the returned vector?

Copy link

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 17 files at r2.
Reviewable status: 9 of 30 files reviewed, 9 unresolved discussions (waiting on @byrnedj and @igchor)


cachelib/allocator/MMLru-inl.h line 225 at r3 (raw file):

  return lruMutex_->lock_combine([this, begin, end, currTime]() {
    uint32_t i = 0;
    for (auto itr = begin; itr != end; itr++) {

nit: prefix form is more relevant here: ++itr


cachelib/allocator/MMLru-inl.h line 227 at r3 (raw file):

    for (auto itr = begin; itr != end; itr++) {
      T* node = *itr;
      if (node->isInMMContainer()) {

How the node may be already in the MM container?

Copy link
Author

@byrnedj byrnedj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 30 files reviewed, 7 unresolved discussions (waiting on @igchor and @vinser52)


cachelib/allocator/BackgroundMover.h line 56 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

do you really need an ordered map?

yes because it is nice to iterate through the (cid,pid,tid) in order during stats output. also I am avoiding implementing a hash function that is required in unordered_map


cachelib/allocator/BackgroundMover.h line 81 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

Why do we need to pass numTraversals as a parameter? Why we cannot use the numTraversals_ member?

Done.


cachelib/allocator/BackgroundMover-inl.h line 38 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

Should we also increment numTraversals_?

fixed use of numTraversals - we will use the count from BackgroundMover and remove the numTraversals_ (duplicate)


cachelib/allocator/BackgroundMover-inl.h line 44 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

if numTraversals equals 0 can we use the numTraversals_ member and return totalTraversalTimeNs_ / numTraversals_?

Done.


cachelib/allocator/BackgroundMover-inl.h line 131 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

Use the ClassBgStatsType alias instead of std::map<MemoryDescriptorType,uint64_t>

Done.


cachelib/allocator/BackgroundMoverStrategy.h line 51 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

what does batch size equal 0 mean? Why we should add 0 to the returned vector?

added comment - it means we are already have the % free space target in this class


cachelib/allocator/MMLru-inl.h line 227 at r3 (raw file):

Previously, vinser52 (Sergei Vinogradov) wrote…

How the node may be already in the MM container?

it shouldn't be, I changed this to a runtime_error.:

@byrnedj byrnedj force-pushed the bg-improvements-no-pool-rebal branch from 36129e4 to c67df92 Compare February 28, 2024 21:27
@byrnedj
Copy link
Author

byrnedj commented Feb 28, 2024

@byrnedj could you please split the changes logically into commits (and remove commits with fixes after review, etc.)? I would prefer not to squash all of them on merge, so that it's easier to upstream it.

Also, would it be possible to share the implementation of findEvictionBatch with findEviction? Or should we consider that after merging this and rebasing (since there are some change in upstream I think).

I have done so. Once the rebase is finished I will rebase this branch and we can merge it.

@igchor
Copy link

igchor commented Mar 4, 2024

@byrnedj could you please split the changes logically into commits (and remove commits with fixes after review, etc.)? I would prefer not to squash all of them on merge, so that it's easier to upstream it.
Also, would it be possible to share the implementation of findEvictionBatch with findEviction? Or should we consider that after merging this and rebasing (since there are some change in upstream I think).

I have done so. Once the rebase is finished I will rebase this branch and we can merge it.

Sounds good. Regarding sharing the implementation - we can probably do it later, there might be some refactoring needed anyway or some comments from meta.

@byrnedj byrnedj force-pushed the bg-improvements-no-pool-rebal branch from c67df92 to 57eade8 Compare March 25, 2024 19:20
@byrnedj byrnedj merged commit bd69c27 into intel:develop Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants