[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219

JoshRosen · 2018-05-02T22:12:33Z

What changes were proposed in this pull request?

This patch modifies ShuffleBlockFetcherIterator so that the receipt of zero-size blocks is treated as an error. This is done as a preventative measure to guard against a potential source of data loss bugs.

In the shuffle layer, we guarantee that zero-size blocks will never be requested (a block containing zero records is always 0 bytes in size and is marked as empty such that it will never be legitimately requested by executors). However, the existing code does not fully take advantage of this invariant in the shuffle-read path: the existing code did not explicitly check whether blocks are non-zero-size.

Additionally, our decompression and deserialization streams treat zero-size inputs as empty streams rather than errors (EOF might actually be treated as "end-of-stream" in certain layers (longstanding behavior dating to earliest versions of Spark) and decompressors like Snappy may be tolerant to zero-size inputs).

As a result, if some other bug causes legitimate buffers to be replaced with zero-sized buffers (due to corruption on either the send or receive sides) then this would translate into silent data loss rather than an explicit fail-fast error.

This patch addresses this problem by adding a buf.size != 0 check. See code comments for pointers to tests which guarantee the invariants relied on here.

How was this patch tested?

Existing tests (which required modifications, since some were creating empty buffers in mocks). I also added a test to make sure we fail on zero-size blocks.

To test that the zero-size blocks are indeed a potential corruption source, I manually ran a workload in spark-shell with a modified build which replaces all buffers with zero-size buffers in the receive path.

SparkQA · 2018-05-03T02:28:31Z

Test build #90077 has finished for PR 21219 at commit 41d06e1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2018-05-03T04:37:02Z

jenkins retest this please

SparkQA · 2018-05-03T07:05:01Z

Test build #90093 has finished for PR 21219 at commit 41d06e1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2018-05-03T07:43:41Z

jenkins retest this please

SparkQA · 2018-05-03T11:49:00Z

Test build #90112 has finished for PR 21219 at commit 41d06e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-03T13:01:11Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+            //
+            // There is not an explicit test for SortShuffleWriter but the underlying APIs that
+            // uses are shared by the UnsafeShuffleWriter (both writers use DiskBlockObjectWriter
+            // which returns a zero-size from commitAndGet() in case the no records were written


Seems a typo the no btw.

SparkQA · 2018-05-04T02:14:28Z

Test build #90161 has finished for PR 21219 at commit 3ecd7da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM, cc @cloud-fan

cloud-fan · 2018-05-07T06:34:14Z

thanks, merging to master!

ShuffleBlockFetcherIterator should fail if it receives zero-size blocks

41d06e1

HyukjinKwon reviewed May 3, 2018

View reviewed changes

Update ShuffleBlockFetcherIterator.scala

3ecd7da

jiangxb1987 approved these changes May 6, 2018

View reviewed changes

asfgit closed this in d2aa859 May 7, 2018

LantaoJin mentioned this pull request Mar 21, 2019

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219

[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219

Uh oh!

JoshRosen commented May 2, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

JoshRosen commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

JoshRosen commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

HyukjinKwon May 3, 2018

Uh oh!

JoshRosen May 3, 2018

Uh oh!

SparkQA commented May 4, 2018

Uh oh!

jiangxb1987 left a comment

Uh oh!

cloud-fan commented May 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219

[SPARK-24160] ShuffleBlockFetcherIterator should fail if it receives zero-size blocks #21219

Uh oh!

Conversation

JoshRosen commented May 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

JoshRosen commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

JoshRosen commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

HyukjinKwon May 3, 2018

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 3, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 4, 2018

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants