-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #90077 has finished for PR 21219 at commit
|
|
jenkins retest this please |
|
Test build #90093 has finished for PR 21219 at commit
|
|
jenkins retest this please |
|
Test build #90112 has finished for PR 21219 at commit
|
| // | ||
| // There is not an explicit test for SortShuffleWriter but the underlying APIs that | ||
| // uses are shared by the UnsafeShuffleWriter (both writers use DiskBlockObjectWriter | ||
| // which returns a zero-size from commitAndGet() in case the no records were written |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a typo the no btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
Test build #90161 has finished for PR 21219 at commit
|
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @cloud-fan
|
thanks, merging to master! |
What changes were proposed in this pull request?
This patch modifies
ShuffleBlockFetcherIteratorso that the receipt of zero-size blocks is treated as an error. This is done as a preventative measure to guard against a potential source of data loss bugs.In the shuffle layer, we guarantee that zero-size blocks will never be requested (a block containing zero records is always 0 bytes in size and is marked as empty such that it will never be legitimately requested by executors). However, the existing code does not fully take advantage of this invariant in the shuffle-read path: the existing code did not explicitly check whether blocks are non-zero-size.
Additionally, our decompression and deserialization streams treat zero-size inputs as empty streams rather than errors (EOF might actually be treated as "end-of-stream" in certain layers (longstanding behavior dating to earliest versions of Spark) and decompressors like Snappy may be tolerant to zero-size inputs).
As a result, if some other bug causes legitimate buffers to be replaced with zero-sized buffers (due to corruption on either the send or receive sides) then this would translate into silent data loss rather than an explicit fail-fast error.
This patch addresses this problem by adding a
buf.size != 0check. See code comments for pointers to tests which guarantee the invariants relied on here.How was this patch tested?
Existing tests (which required modifications, since some were creating empty buffers in mocks). I also added a test to make sure we fail on zero-size blocks.
To test that the zero-size blocks are indeed a potential corruption source, I manually ran a workload in
spark-shellwith a modified build which replaces all buffers with zero-size buffers in the receive path.