-
Notifications
You must be signed in to change notification settings - Fork 341
[llvm][cas] Prevent corruption on ENOSPC on sparse filesystems #10870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
@swift-ci please test llvm |
9b08596
to
c7d380b
Compare
@swift-ci please test llvm |
c7d380b
to
0c4fdde
Compare
@swift-ci please test llvm |
// We are creating a new file; run the constructor. | ||
if (Error E = NewFileConstructor(Result)) | ||
return std::move(E); | ||
|
||
// Get the allocated size again in case it grew during construction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when file has exclusive access to allocate the file right? Why the allocated size is not initialized in initializeBumPtr
and need to set it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initializeBumPtr
is called from the DatabaseFile::create
which doesn't have this information; are you suggesting we pass it through to there?
FileSize = FileSizeInfo::get(File); | ||
if (!FileSize) | ||
return createFileError(Result.Path, FileSize.getError()); | ||
Result.H->AllocatedSize.exchange(FileSize->AllocatedSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllocatedSize can change between FileSizeInfo::get
and exchange
.
I feel like FileSizeInfo::get()
should get the atomic<>*
and compare_exchange to always set the correct value (or go back to a loop). Not sure if that is going to hurt performance or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code inside if (FileSize->Size < Capacity) {
has exclusive access so the size cannot change here if everyone is locking correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The new file case and the first open after truncation case is kind of mixed up here. Is it possible to refactor the code so there is only one update to AllocatedSize
that covers both case together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined these into a single place. Also added some asserts that we have exclusive access from the file lock in the right places.
if (Error E = preallocateFileTail(*FD, DiskSize, DiskSize + Increment).moveInto(NewSize)) | ||
return std::move(E); | ||
assert(NewSize >= DiskSize + Increment); | ||
// FIXME: on Darwin this can under-count the size if there is a race to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?
But maybe it is not a problem if we can also get preallocateFileTail
to increment atomic directly with FileSizeInfo::get
loop, with unknown perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any side affect for under-count? For example, if the file capacity is 1MB and we are undercounting so we allocate pass the file capacity. Is that an error or some undefined behavior?
No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).
preallocateFileTail to increment atomic directly with FileSizeInfo::get loop
Doing a fetch_add
with the size change instead of compare_and_swap
of the new size theoretically fixes the issue, but I don't like making preallocateFileTail
have to be aware of the atomic size counter. Doing a size check with stat()
is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.
My reasoning for the current approach:
- The only downside is we might make more allocations than we need if some of the allocations race to allocate disk.
- When we close/reopen the CAS we will fix the mismatch in size, so the space won't be lost long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, under counting the size here will just mean we allocate more disk space than we need to during a future allocation (no impact on correctness, may have small effect on performance).
I mean under counting can make us allocate pass the end of the file size. Is that defined? Can you allocate 1.2MB when the file size is 1MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean under counting can make us allocate pass the end of the file size.
The file has two different sizes: the apparent file size and the allocated size on disk. In general, you can have apparent size > allocated (sparse file), or apparent size < allocated (disk allocation will be in sectors of at least 512 bytes, and maybe larger, so it's common for this to happen even without special API calls).
The apparent file size for the mapped files will always be Capacity (typically GBs) using a sparse tail if the filesystem supports sparse tails. The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size. So I don't think there is any problem here except that we might allocate more frequently than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a fetch_add with the size change instead of compare_and_swap of the new size theoretically fixes the issue, but I don't like making preallocateFileTail have to be aware of the atomic size counter. Doing a size check with stat() is an additional syscall, which increases overhead much more than the atomic operations, so I don't think that's worthwhile.
Actually, I don't think fetch_add
will fix the problem, because the process can terminate in between atomic and sys calls, no matter which goes first. I don't have better idea here right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The under-counting behaviour here will cause the allocated size on disk to be larger than what we record in the atomic counter for the disk size
Right, our atomic counter can be smaller than the allocated size. If that happens around the file capacity, let's say the file capacity is 1MB, allocated file size is 1MB but atomic pointer is 900KB, it will fallocate
more space. I am asking is if this is legal and well behaved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is well-behaved. On Darwin fcntl F_PREALLOCATE, this will succeed (as long as there is space) and change the on-disk size of the file, but not change the apparent size of the file.
For posix_fallocate, it will increase the apparent file size if we allocate beyond the end. On Linux there is also (non-posix) fallocate, which has an option to avoid changing the apparent size if desired, but I don't think it matters since we use Capacity not the file size to limit allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional LGTM. Just that one small comment to make code more readable (unified how allocated file size is written on open).
If a platform and filesystem do not detect disk out of space errors during page fault but instead defer it to page flush, we can corrupt the CAS data without receiving an error until it is too late. Fix that on platforms that support preallocating disk space by ensuring all CAS allocations are allocated to disk before writing. For standalone files, we ensure the entire file is allocated up front. For bump pointer allocated files (index, datapool, cache), we allocate in chunks of 1 MB to amortize the cost, and store the current disk-allocated size into the database file next to the bump pointer so that every thread and process can check their allocations. The overhead from these additional checks is <1% in a store-heavy stress test on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem Darwin/tmpfs. rdar://152273395
0c4fdde
to
cca6e01
Compare
@swift-ci please test llvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@swift-ci please test llvm |
1 similar comment
@swift-ci please test llvm |
If a platform and filesystem do not detect disk out of space errors during page fault but instead defer it to page flush, we can corrupt the CAS data without receiving an error until it is too late. Fix that on platforms that support preallocating disk space by ensuring all CAS allocations are allocated to disk before writing. For standalone files, we ensure the entire file is allocated up front. For bump pointer allocated files (index, datapool, cache), we allocate in chunks of 1 MB to amortize the cost, and store the current disk-allocated size into the database file next to the bump pointer so that every thread and process can check their allocations.
The overhead from these additional checks is <1% in a store-heavy stress test on Darwin/APFS, and effectively zero overhead on a non-sparse filesystem Darwin/tmpfs.
rdar://152273395