Skip to content

Conversation

@original-brownbear
Copy link
Contributor

Removing two pieces of the BlobContainer API here that don't apply well to all implementations and needlessly complicate the code:

  • Having both write and writeAtomic makes no sense. On Cloud providers all writes are atomic and on FS and HDFS the added cost of a move and directory fsync operation is hardly relevant.
    • On the other hand, allowing for partial writes (though extremely unlikely in practice) simply has the potential of creating more garbage blobs.
  • Removing the fail of exists flag because it does not work on the most widely used implementation (S3) anyway
    • It simply makes no sense having this flag. Either it is relevant for the stable operation of the plugin repository in which case S3 is broken by definition or it isn't in which case it's dead code.
    • The actual resiliency improvement form this flag is completely minor since we use UUIDs for almost all the blobs now anyway. The flag is in fact causing a bug (see SNAPSHOT: More Resilient Writes to Blob Stores #36927 for a detailed explanation of that)
  • closes Master failover during snapshotting could leave the snapshot incomplete #25281

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

return unremoved;
}

// TODO: replace with constants class if needed (cf. org.apache.lucene.util.Constants)
Copy link
Contributor Author

@original-brownbear original-brownbear Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #42883, the failure described there actually comes up easily in tests here when testing concurrent access to the snapshot repository.

@tlrx
Copy link
Member

tlrx commented Jun 13, 2019

Concerning the first point, I agree that using a single writeBlobAtomic makes sense but I'm a bit concerned about the potential performance impact on shared file systems like NFS. Do you have any idea? Today we only use atomic writes for the repository's index generation file and doing so for all data files might have a cost.

Concerning the second point, failing to write a blob if it already exists has been added recently and provides an extra safety for concurrent snapshot finalizations which could happen if multiple snapshots are created or deleted successively. By removing this check were are now silently ignoring a potential concurrent blob modification and I don't think we should do so.

@original-brownbear
Copy link
Contributor Author

@tlrx

Thanks for taking a look!

I'm a bit concerned about the potential performance impact on shared file systems like NFS. Do you have any idea?

As I tried to argue above, I'd say it doesn't matter. All we're doing is adding one operation (the move) to each write. Writing the segment blobs takes longer than writing small meta blobs in the real world (you have the request latency + the time it takes to stream the data) so at worst we're not even doubling the time per segment blob (and that would be in the weird edge case of very small segment blobs). In the real world with like 100MB+ segment blobs I don't think this matters much (especially considering that the default write rate limit is 50MB/s :)).
=> yes weird snapshots with a huge number of almost empty segment files could theoretically take twice as long but not more I think. Real world snapshots will barely see a difference if any I think.

Concerning the second point, failing to write a blob if it already exists has been added recently and provides an extra safety for concurrent snapshot finalizations which could happen if multiple snapshots are created or deleted successively. By removing this check were are now silently ignoring a potential concurrent blob modification and I don't think we should do so.

I would argue that this check doesn't work on S3 at all anyway. On NFS it's not guaranteed to work either. So yes this is an additional safety net but we can't base the functionality on it since we don't have it in the most use plugin. This is actually somewhat bad even, because the majority of our tests run against the FS repository (and the test that I unmuted failed because of this check) and will behave "quasi safer" than they are in the real world.
As far as the finalization goes, this isn't really an issue in the real world because a stuck master will simply write the exact same blob that the current master wrote concurrently. The only protection we get is for the theoretical case of a master that is stuck for the whole time it takes to run another snapshot after it got stuck. From analyzing logs for the snapshot resiliency work on cloud I can say with confidence that this has never happened on Cloud ever (like literally never :)). Another way to think about it is, that the master still has the safety net (that also works on S3) of listing the index-N blobs before finalizing so you'd have to get stuck after listing the index-N for (in the real world) 30s+ and then unstuck and write the index-N.
And even then, all you're corrupting is the index.latest content because the index-N would just have the old content => I don't think there is any value in this check at all anymore :)

@tlrx
Copy link
Member

tlrx commented Jun 18, 2019

I think that this PR could be split into two different PRs and discussed separately.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 18, 2019
* Extracted from elastic#42886
* Having both write and writeAtomic makes no sense. On Cloud providers all writes are atomic and on FS and HDFS the added cost of a move and directory fsync operation is hardly relevant.
* On the other hand, allowing for partial writes (though extremely unlikely in practice) simply has the potential of creating more garbage blobs that aren't even deserializable. This change neatly ensures that every blob by a non-temp name has been fully written and fsynced (in case of NFS + HDFS).
@original-brownbear
Copy link
Contributor Author

@tlrx yea hat makes sense, sorry for mixing thins up here. I extracted the removal of atomic writes to #43329 let's start there, it's probably less controversial? :)

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
@colings86 colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019
@original-brownbear
Copy link
Contributor Author

Closing this as it will become irrelevant as a result of #46250 and follow-ups (since we'll use the CS for consistency of the repository contents) and we can simply remove these methods once they become unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Master failover during snapshotting could leave the snapshot incomplete

5 participants