[llvm][cas] Prevent corruption on ENOSPC on sparse filesystems #10870

benlangmuir · 2025-06-18T18:29:00Z

If a platform and filesystem do not detect disk out of space errors during page fault but instead defer it to page flush, we can corrupt the CAS data without receiving an error until it is too late. Fix that on platforms that support preallocating disk space by ensuring all CAS allocations are allocated to disk before writing. For standalone files, we ensure the entire file is allocated up front. For bump pointer allocated files (index, datapool, cache), we allocate in chunks of 1 MB to amortize the cost, and store the current disk-allocated size into the database file next to the bump pointer so that every thread and process can check their allocations.

The overhead from these additional checks is <1% in a store-heavy stress test on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem Darwin/tmpfs.

rdar://152273395

benlangmuir · 2025-06-18T18:29:08Z

@swift-ci please test llvm

benlangmuir · 2025-06-18T19:43:04Z

@swift-ci please test llvm

benlangmuir · 2025-06-23T22:19:49Z

@swift-ci please test llvm

cachemeifyoucan · 2025-06-24T17:10:08Z

llvm/lib/CAS/MappedFileRegionBumpPtr.cpp

    // We are creating a new file; run the constructor.
    if (Error E = NewFileConstructor(Result))
      return std::move(E);
+
+    // Get the allocated size again in case it grew during construction.


This is when file has exclusive access to allocate the file right? Why the allocated size is not initialized in initializeBumPtr and need to set it here?

initializeBumPtr is called from the DatabaseFile::create which doesn't have this information; are you suggesting we pass it through to there?

cachemeifyoucan · 2025-06-24T17:14:52Z

llvm/lib/CAS/MappedFileRegionBumpPtr.cpp

+      FileSize = FileSizeInfo::get(File);
+      if (!FileSize)
+        return createFileError(Result.Path, FileSize.getError());
+      Result.H->AllocatedSize.exchange(FileSize->AllocatedSize);


AllocatedSize can change between FileSizeInfo::get and exchange.

I feel like FileSizeInfo::get() should get the atomic<>* and compare_exchange to always set the correct value (or go back to a loop). Not sure if that is going to hurt performance or not.

The code inside if (FileSize->Size < Capacity) { has exclusive access so the size cannot change here if everyone is locking correctly.

I see. The new file case and the first open after truncation case is kind of mixed up here. Is it possible to refactor the code so there is only one update to AllocatedSize that covers both case together?

Combined these into a single place. Also added some asserts that we have exclusive access from the file lock in the right places.

cachemeifyoucan · 2025-06-24T17:18:25Z

llvm/lib/CAS/MappedFileRegionBumpPtr.cpp

+    if (Error E = preallocateFileTail(*FD, DiskSize, DiskSize + Increment).moveInto(NewSize))
+      return std::move(E);
+    assert(NewSize >= DiskSize + Increment);
+    // FIXME: on Darwin this can under-count the size if there is a race to


Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?

But maybe it is not a problem if we can also get preallocateFileTail to increment atomic directly with FileSizeInfo::get loop, with unknown perf.

Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?

No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).

preallocateFileTail to increment atomic directly with FileSizeInfo::get loop

Doing a fetch_add with the size change instead of compare_and_swap of the new size theoretically fixes the issue, but I don't like making preallocateFileTail have to be aware of the atomic size counter. Doing a size check with stat() is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.

My reasoning for the current approach:

The only downside is we might make more allocations than we need if some of the allocations race to allocate disk.

When we close/reopen the CAS we will fix the mismatch in size, so the space won't be lost long term.

No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).

I mean under counting can make us allocate pass the end of the file size. Is that defined? Can you allocate 1.2MB when the file size is 1MB?

I mean under counting can make us allocate pass the end of the file size.

The file has two different sizes: the apparent file size and the allocated size on disk. In general, you can have apparent size > allocated (sparse file), or apparent size < allocated (disk allocation will be in sectors of at least 512 bytes, and maybe larger, so it's common for this to happen even without special API calls).

The apparent file size for the mapped files will always be Capacity (typically GBs) using a sparse tail if the filesystem supports sparse tails. The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size. So I don't think there is any problem here except that we might allocate more frequently than needed.

Doing a fetch_add with the size change instead of compare_and_swap of the new size theoretically fixes the issue, but I don't like making preallocateFileTail have to be aware of the atomic size counter. Doing a size check with stat() is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.

Actually, I don't think fetch_add will fix the problem, because the process can terminate in between atomic and sys calls, no matter which goes first. I don't have better idea here right now.

The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size

Right, our atomic counter can be smaller than the allocated size. If that happens around the file capacity, let's say the file capacity is 1MB, allocated file size is 1MB but atomic pointer is 900KB, it will fallocate more space. I am asking is if this is legal and well behaved.

Yes, this is well-behaved. On Darwin fcntl F_PREALLOCATE, this will succeed (as long as there is space) and change the on-disk size of the file, but not change the apparent size of the file.

For posix_fallocate, it will increase the apparent file size if we allocate beyond the end. On Linux there is also (non-posix) fallocate, which has an option to avoid changing the apparent size if desired, but I don't think it matters since we use Capacity not the file size to limit allocations.

cachemeifyoucan

Functional LGTM. Just that one small comment to make code more readable (unified how allocated file size is written on open).

If a platform and filesystem do not detect disk out of space errors during page fault but instead defer it to page flush, we can corrupt the CAS data without receiving an error until it is too late. Fix that on platforms that support preallocating disk space by ensuring all CAS allocations are allocated to disk before writing. For standalone files, we ensure the entire file is allocated up front. For bump pointer allocated files (index, datapool, cache), we allocate in chunks of 1 MB to amortize the cost, and store the current disk-allocated size into the database file next to the bump pointer so that every thread and process can check their allocations. The overhead from these additional checks is <1% in a store-heavy stress test on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem Darwin/tmpfs. rdar://152273395

benlangmuir · 2025-06-25T16:51:33Z

@swift-ci please test llvm

cachemeifyoucan

LGTM

benlangmuir · 2025-06-25T20:16:32Z

@swift-ci please test llvm

benlangmuir · 2025-06-26T16:08:19Z

@swift-ci please test llvm

benlangmuir force-pushed the enospc-apfs-2 branch from 9b08596 to c7d380b Compare June 18, 2025 19:30

benlangmuir force-pushed the enospc-apfs-2 branch from c7d380b to 0c4fdde Compare June 23, 2025 22:19

benlangmuir requested review from cachemeifyoucan and akyrtzi June 24, 2025 16:05

cachemeifyoucan reviewed Jun 24, 2025

View reviewed changes

cachemeifyoucan approved these changes Jun 24, 2025

View reviewed changes

benlangmuir force-pushed the enospc-apfs-2 branch from 0c4fdde to cca6e01 Compare June 25, 2025 16:50

cachemeifyoucan approved these changes Jun 25, 2025

View reviewed changes

[llvm][cas] Prevent corruption on ENOSPC on sparse filesystems #10870

Are you sure you want to change the base?

[llvm][cas] Prevent corruption on ENOSPC on sparse filesystems #10870

Conversation

benlangmuir commented Jun 18, 2025

Uh oh!

benlangmuir commented Jun 18, 2025

Uh oh!

benlangmuir commented Jun 18, 2025

Uh oh!

benlangmuir commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cachemeifyoucan left a comment

Choose a reason for hiding this comment

Uh oh!

benlangmuir commented Jun 25, 2025

Uh oh!

cachemeifyoucan left a comment

Choose a reason for hiding this comment

Uh oh!

benlangmuir commented Jun 25, 2025

Uh oh!

benlangmuir commented Jun 26, 2025

Uh oh!

Uh oh!