Skip to content

[llvm][cas] Prevent corruption on ENOSPC on sparse filesystems #10870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: next
Choose a base branch
from

Conversation

benlangmuir
Copy link

If a platform and filesystem do not detect disk out of space errors during page fault but instead defer it to page flush, we can corrupt the CAS data without receiving an error until it is too late. Fix that on platforms that support preallocating disk space by ensuring all CAS allocations are allocated to disk before writing. For standalone files, we ensure the entire file is allocated up front. For bump pointer allocated files (index, datapool, cache), we allocate in chunks of 1 MB to amortize the cost, and store the current disk-allocated size into the database file next to the bump pointer so that every thread and process can check their allocations.

The overhead from these additional checks is <1% in a store-heavy stress test on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem Darwin/tmpfs.

rdar://152273395

@benlangmuir
Copy link
Author

@swift-ci please test llvm

@benlangmuir
Copy link
Author

@swift-ci please test llvm

@benlangmuir
Copy link
Author

@swift-ci please test llvm

// We are creating a new file; run the constructor.
if (Error E = NewFileConstructor(Result))
return std::move(E);

// Get the allocated size again in case it grew during construction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when file has exclusive access to allocate the file right? Why the allocated size is not initialized in initializeBumPtr and need to set it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializeBumPtr is called from the DatabaseFile::create which doesn't have this information; are you suggesting we pass it through to there?

FileSize = FileSizeInfo::get(File);
if (!FileSize)
return createFileError(Result.Path, FileSize.getError());
Result.H->AllocatedSize.exchange(FileSize->AllocatedSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllocatedSize can change between FileSizeInfo::get and exchange.

I feel like FileSizeInfo::get() should get the atomic<>* and compare_exchange to always set the correct value (or go back to a loop). Not sure if that is going to hurt performance or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code inside if (FileSize->Size < Capacity) { has exclusive access so the size cannot change here if everyone is locking correctly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The new file case and the first open after truncation case is kind of mixed up here. Is it possible to refactor the code so there is only one update to AllocatedSize that covers both case together?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined these into a single place. Also added some asserts that we have exclusive access from the file lock in the right places.

if (Error E = preallocateFileTail(*FD, DiskSize, DiskSize + Increment).moveInto(NewSize))
return std::move(E);
assert(NewSize >= DiskSize + Increment);
// FIXME: on Darwin this can under-count the size if there is a race to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?

But maybe it is not a problem if we can also get preallocateFileTail to increment atomic directly with FileSizeInfo::get loop, with unknown perf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?

No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).

preallocateFileTail to increment atomic directly with FileSizeInfo::get loop

Doing a fetch_add with the size change instead of compare_and_swap of the new size theoretically fixes the issue, but I don't like making preallocateFileTail have to be aware of the atomic size counter. Doing a size check with stat() is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.

My reasoning for the current approach:

  • The only downside is we might make more allocations than we need if some of the allocations race to allocate disk.
  • When we close/reopen the CAS we will fix the mismatch in size, so the space won't be lost long term.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).

I mean under counting can make us allocate pass the end of the file size. Is that defined? Can you allocate 1.2MB when the file size is 1MB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean under counting can make us allocate pass the end of the file size.

The file has two different sizes: the apparent file size and the allocated size on disk. In general, you can have apparent size > allocated (sparse file), or apparent size < allocated (disk allocation will be in sectors of at least 512 bytes, and maybe larger, so it's common for this to happen even without special API calls).

The apparent file size for the mapped files will always be Capacity (typically GBs) using a sparse tail if the filesystem supports sparse tails. The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size. So I don't think there is any problem here except that we might allocate more frequently than needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a fetch_add with the size change instead of compare_and_swap of the new size theoretically fixes the issue, but I don't like making preallocateFileTail have to be aware of the atomic size counter. Doing a size check with stat() is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.

Actually, I don't think fetch_add will fix the problem, because the process can terminate in between atomic and sys calls, no matter which goes first. I don't have better idea here right now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size

Right, our atomic counter can be smaller than the allocated size. If that happens around the file capacity, let's say the file capacity is 1MB, allocated file size is 1MB but atomic pointer is 900KB, it will fallocate more space. I am asking is if this is legal and well behaved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is well-behaved. On Darwin fcntl F_PREALLOCATE, this will succeed (as long as there is space) and change the on-disk size of the file, but not change the apparent size of the file.

For posix_fallocate, it will increase the apparent file size if we allocate beyond the end. On Linux there is also (non-posix) fallocate, which has an option to avoid changing the apparent size if desired, but I don't think it matters since we use Capacity not the file size to limit allocations.

Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional LGTM. Just that one small comment to make code more readable (unified how allocated file size is written on open).

If a platform and filesystem do not detect disk out of space errors during page
fault but instead defer it to page flush, we can corrupt the CAS data without
receiving an error until it is too late. Fix that on platforms that
support preallocating disk space by ensuring all CAS allocations are
allocated to disk before writing. For standalone files, we ensure the
entire file is allocated up front. For bump pointer allocated files
(index, datapool, cache), we allocate in chunks of 1 MB to amortize the
cost, and store the current disk-allocated size into the database file
next to the bump pointer so that every thread and process can check
their allocations.

The overhead from these additional checks is <1% in a store-heavy stress test
on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem
Darwin/tmpfs.

rdar://152273395
@benlangmuir
Copy link
Author

@swift-ci please test llvm

Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benlangmuir
Copy link
Author

@swift-ci please test llvm

1 similar comment
@benlangmuir
Copy link
Author

@swift-ci please test llvm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants