Skip to content

Conversation

ojhunt
Copy link

@ojhunt ojhunt commented Aug 2, 2025

The CAS OnDiskGraphDB backend currently uses a single directory for persistent storage. Over time this can lead to a very large number of files accumulating in one directory. While the point at which this overhead becomes significant on any given file system, all of them eventually reach a point where the number of entries in a single directory starts to seriously degrade the performance of an wide array of operations performed on the entries within that directory or the directory itself.

In this PR we introduce an intermediate layer of subdirectories in order to spread the individual persistent files over a large number of different directories. This is the same general approach that has been taken by a wide array of other applications for exactly the same reasons.

The PR currently uses 2 radix-36 characters giving an approximately 10 bit fan out. There does not seem to be a consistent bias in favour of wider or deeper fan outs, but for now a relatively wide and shallow approach seems reasonable.

We current choose not to use a fan out for temporary files mostly for practical reasons. When fanning out across subdirs we need to ensure that the subdirectories exist, LLVM's current APIs for constructing temporary files do not provide a mechanism for automatically building the required directories. While we could add such functionality, that may actually hinder performance rather than helping: temporary files created by llvm are just that, and so are cleaned up at the end of execution. As a result the risk of large numbers of files accumulating is relatively low. At the same time the additional file system work needed to check for and then create new directory entries is relatively high.

@ojhunt ojhunt requested a review from benlangmuir August 2, 2025 00:24
@ojhunt ojhunt self-assigned this Aug 2, 2025
@ojhunt
Copy link
Author

ojhunt commented Aug 2, 2025

I'm not super happy with the optional TempDir out parameter I use but everything else seems worse.

It also seems like the CAS version should change but I don't know what the rules around that are.


// This is around 10 bits of entropy given we're using radix-36
static const unsigned IntermediateDirNameLength = 2;
static const char Radix36Chars[] = "0123456789abcdefghijklmnopqrstuvwxyz";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benlangmuir This is a relatively common encoding for people to use in this kind of context but I couldn't find an existing impl anywhere in the tree, and I was not sure that it was so incredibly such that it warranted adding it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I've usually seen this done in llvm:

toString(llvm::APInt(64, Hash), 36, /*Signed=*/false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sent before finishing this comment:

There is also std::to_chars, which doesn't need to allocate a string.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringExtras has toHex which takes an ArrayRef of bytes. I suggest just take few bytes from data hash and toHex and use that as directory name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was deliberately using base36 - base 16 with a 2 char fan gives a fanning factor of 256 instead of 1296, if we bump up to 3 characters, then hex reaches 4k sub dirs which starts running into birthday paradox issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always zero out few bits from char to make your hash, unless only having hex digits in the file name also means bad hashing distribution for file system hash, or a bad distribution of chars in the filename can cause that (for example, if you take 9 bits, the first char of the filename is either 1 or 0).

lol GitHub didn't show me this comment :D

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted toHex, and split to multiple directories which should very significantly reduce pressure on the filesystem - I re-did the math on ye olde birthday paradox and even 10 bits leads to a higher than I'd like distribution. The current iteration allows multiple levels of intermediate dirs, but as Ben pointed out that could make other parts slower, but maybe the rename -> try to construct the path would mitigate that, though until you get a large enough number of files that might not matter?

I'm a bit on the fence here - git and similar do seem to prefer multiple levels but they produce huge numbers of objects, and the real issue we hit with cas is lack of pruning when being used directly. Resolving the pruning would presumably limit the number of files as well, reducing the need for multiple layers of dirs.

Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to start with a single level of sub directories unless there is compelling evidence we need more. Each directory increases the cost of accessing these files and creating the requisite directories.

git and similar do seem to prefer multiple levels

I thought git/objects uses a single byte prefix (two hex digits)? It's not really comparable though, because it only stores new objects this way and later moves them into packs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I leave the support for multiple levels in there? My inclination is not to as it's easy enough to add back later if we do decide to extend it.

One other thing I realized is that this will likely impact anything that tries to prune the caches, but I have no idea how anything does that.

Copy link

@benlangmuir benlangmuir Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I leave the support for multiple levels in there?

No, better to simplify.

One other thing I realized is that this will likely impact anything that tries to prune the caches, but I have no idea how anything does that.

There should be no impact (EDIT: except performance), because we recursively delete a whole v1.x directory.

PersistentPath.assign(RootPath.begin(), RootPath.end());
if (TempPath)
TempPath->assign(RootPath.begin(), RootPath.end());
SmallVector<char, 256> FileNameBuffer;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buffer is a bit irksome, but the lack of a Twine iterator requires the copy in order to hash the filename.

We could just hash the Offset which would save the copy but given we have prefix and suffix available it seemed reasonable to use them.

I may look into the stable hash interfaces over the weekend to see if this can be performed in a better way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just hash the offset. The prefix is the same for every file in the CAS and each offset only corresponds to a single file, so suffix is redundant with the offset for hashing purposes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just reuse the hash for data. IndexProxy has Hash for the content and you can just grab some bytes from there. That is BLAKE3 hash so it should be strong enough to fan out the storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will do, I just assumed at this point there was no hash available anymore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cachemeifyoucan hmm, the IndexProxy is of the content hash - I assume it does not include the offset? I may blend them if so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to blend them. There is a 1:1 mapping and the the hash in the index proxy is a cryptographic hash so will already have a good distribution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benlangmuir @cachemeifyoucan how does lookup work when opening the file? I assume we already have the content hash at that point? I'm not sure what "offset" represents here so I'm not entirely sure of the semantics of everything.

i.e is it possible for the same content to be stored with different offsets - i.e is it possible for content with different offsets? (ignoring collisions, same content hash + different offsets)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset is not a stable value associated with the content stored, but with the CAS itself. Offset only associate with this specific instance of the CAS and the value is given when the object is stored inside.

The reason the file uses offset is because it is much shorter than the full hash.

Copy link

@cachemeifyoucan cachemeifyoucan Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a SetVector. Hash is the Key to store into the Set. Offset is the index to SetVector and is given based on the insertion order.

Copy link

@benlangmuir benlangmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a CAS format version bump, which you can accomplish by updating FilePrefix at the top of OnDiskGraphDB.cpp. It will probably require some test changes to match.


// This is around 10 bits of entropy given we're using radix-36
static const unsigned IntermediateDirNameLength = 2;
static const char Radix36Chars[] = "0123456789abcdefghijklmnopqrstuvwxyz";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I've usually seen this done in llvm:

toString(llvm::APInt(64, Hash), 36, /*Signed=*/false);

StringRef FileName =
(FilePrefix + Twine(I.Offset.get()) + Suffix).toStringRef(FileNameBuffer);

unsigned FileNameHash = stable_hash_name(FileName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, stable_hash_... is not stable enough for our purposes. It is documented to be stable across environments and program executions, but critically it is allowed to change in future versions, which would require a CAS format version bump.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benlangmuir @cachemeifyoucan suggested just using IndexProxy.Hash -- would that be reasonable? we don't need the hash to be based on the file name, just that files will get spread out reasonably uniformly. The most recent iteration just does that. Otherwise we could move to xxHash I think - or SIPHash which has to be stable as it feeds into the ptrauth abi.

@@ -596,7 +597,17 @@ Error OnDiskGraphDB::TempFile::keep(const Twine &Name) {
assert(!Done);
Done = true;
// Always try to close and rename.
std::error_code RenameEC = sys::fs::rename(TmpName, Name);
std::error_code RenameEC = sys::fs::create_directories(
sys::path::parent_path(Name.str())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should attempt the rename first and only create the directory on failure. That will increase the cost the first time we put a file in that subdirectory (+1 rename call), but decrease it for any subsequent files (-1 mkdir). A couple of small-ish CAS databases I have sitting around have 3000+ standalone files, which would be more than 2 per sub-directory on average. It's not a big deal either way since we can change this without a format change in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a single level of intermediate dirs this makes a lot of sense.

PersistentPath.assign(RootPath.begin(), RootPath.end());
if (TempPath)
TempPath->assign(RootPath.begin(), RootPath.end());
SmallVector<char, 256> FileNameBuffer;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just hash the offset. The prefix is the same for every file in the CAS and each offset only corresponds to a single file, so suffix is redundant with the offset for hashing purposes.

@ojhunt
Copy link
Author

ojhunt commented Aug 27, 2025

This requires a CAS format version bump, which you can accomplish by updating FilePrefix at the top of OnDiskGraphDB.cpp. It will probably require some test changes to match.

Ah, I was trying to find where that was :D

@ojhunt
Copy link
Author

ojhunt commented Aug 27, 2025

This requires a CAS format version bump, which you can accomplish by updating FilePrefix at the top of OnDiskGraphDB.cpp. It will probably require some test changes to match.

oh, that's right - I was thinking this big of a change might warrant change the top level v1.1 folder to v1.2 or similar?

@cachemeifyoucan
Copy link

oh, that's right - I was thinking this big of a change might warrant change the top level v1.1 folder to v1.2 or similar?

You can only bump to v2. That number is UnifiedOnDiskCache.cpp. I think bumping v9 -> v10 is good enough.

@benlangmuir
Copy link

suggested just using IndexProxy.Hash

I forgot this was available. Yes, this makes sense to me.

I think bumping v9 -> v10 is good enough.

Agreed, there is no reason to bump the outer directory to v2, just the inner one to 10.

The CAS OnDiskGraphDB backend currently uses a single directory for
persistent storage. Over time this can lead to a very large number
of files accumulating in one directory. While the point at which this
overhead becomes significant on any given file system, all of them
eventually reach a point where the number of entries in a single
directory starts to seriously degrade the performance of an wide
array of operations performed on the entries within that directory
or the directory itself.

In this PR we introduce an intermediate layer of subdirectories in
order to spread the individual persistent files over a large number
of different directories. This is the same general approach that has
been taken by a wide array of other applications for exactly the same
reasons.

The PR currently uses 2 radix-36 characters giving an approximately
10 bit fan out. There does not seem to be a consistent bias in favour
of wider or deeper fan outs, but for now a relatively wide and shallow
approach seems reasonable.

We current choose not to use a fan out for temporary files mostly for
practical reasons. When fanning out across subdirs we need to ensure
that the subdirectories exist, LLVM's current APIs for constructing
temporary files do not provide a mechanism for automatically building
the required directories. While we could add such functionality, that
may actually hinder performance rather than helping: temporary files
created by llvm are just that, and so are cleaned up at the end of
execution. As a result the risk of large numbers of files accumulating
is relatively low. At the same time the additional file system work
needed to check for and then create new directory entries is relatively
high.
* Remove the support for multiple spanning layers
* Reorder renaming to blindly attempt a rename first, and only
  attempt to create the directory tree if that fails on the assumption
  that directory collisions are eventually going to be more likely than
  not.
@ojhunt ojhunt force-pushed the users/ojhunt/cas-fanout-persistent-storage branch from 884ca39 to 51e719e Compare September 4, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants