[SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data #28525

turboFei · 2020-05-14T03:38:35Z

What changes were proposed in this pull request?

We've seen some shuffle data corruption during shuffle read phase.

As described in SPARK-26089, spark only checks small shuffle blocks before PR #23453, which is proposed by ankuriitg.

There are two changes/improvements that are made in PR #23453.

Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a
large block is corrupt in the starting, that block will be re-fetched and if that also fails,
FetchFailureException will be thrown.
If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while
reading the stream will be converted to FetchFailureException. This is slightly more aggressive
than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction.

However, I think there still exists some problems with the current shuffle transmitted data verification mechanism:

For a large block, it is checked upto maxBytesInFlight/3 size when fetching shuffle data. So if a large block is corrupt after size maxBytesInFlight/3, it can not be detected in data fetch phase. This has been described in the previous section.
Only the compressed or wrapped blocks are checked, I think we should also check thease blocks which are not wrapped.

This pr complete the verification mechanism for shuffle transmitted data:

Firstly, crc32 is choosed for the checksum verification of shuffle data.

Crc is also used for checksum verification in hadoop, it is simple and fast.

During shuffle write phase, after completing the partitionedFile, we compute

the crc32 value for each partition and then write these digests with the indexs into shuffle index file.

For the sortShuffleWriter and unsafe shuffle writer, there is only one partitionedFile for a shuffleMapTask, so the compution of digests(compute the digests for each partition depend on the indexs of this partitionedFile) is cheap.

For the bypassShuffleWriter, the reduce partitions is little than byPassMergeThreshold, so the cost of digests compution is acceptable.

During shuffle read phase, the digest value will be passed with the block data.

And we will recompute the digest of the data obtained to compare with the origin digest value.
When recomputing the digest of data obtained, it only need an additional buffer(2048Bytes) for computing crc32 value.
After recomputing, we will reset the obtained data inputStream, if it is markSupported we only need reset it, otherwise it is a fileSegmentManagerBuffer, we need recreate it.

So, I think this verification mechanism proposed for shuffle transmitted data is efficient and complete.

How was this patch tested?

Unit test.

turboFei · 2020-05-14T08:28:42Z

cc @jerryshao

cloud-fan · 2020-05-14T13:31:31Z

ok to test

SparkQA · 2020-05-14T14:03:25Z

Test build #122618 has finished for PR 28525 at commit d13e1a2.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-05-14T16:02:22Z

Sounds a great and very useful feature

dongjoon-hyun · 2020-05-14T22:03:27Z

FYI, the dependency failure will be fixed by the following.

[SPARK-31713][INFRA] Make test-dependencies.sh detect version string correctly #28532

SparkQA · 2020-05-15T02:35:29Z

Test build #122640 has finished for PR 28525 at commit e330ca1.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

…e transmitted data

SparkQA · 2020-05-15T07:05:02Z

Test build #122641 has finished for PR 28525 at commit bb15a4d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-15T08:32:33Z

Test build #122659 has finished for PR 28525 at commit b0fdea8.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-15T09:37:28Z

Test build #122664 has finished for PR 28525 at commit 7ed97ec.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-15T10:09:44Z

Test build #122667 has finished for PR 28525 at commit c893dc8.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

common/network-common/src/main/java/org/apache/spark/network/util/DigestUtils.java

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java

jerryshao · 2020-05-19T05:49:42Z

I would suggest to describe the specs of shuffle index file somewhere in the code, and reduce the magical hard code numbers in everywhere.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

jerryshao · 2020-05-20T12:39:33Z

The current implementation uses shuffle index file to store partition digests. I think this: 1) couples two things together, and hard to evolve; 2) makes the logic unintuitive. I would suggest to separate index file from crc file, use a new file to store shuffle digests.

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

turboFei · 2020-05-21T09:56:45Z

Thanks for the review. I have modified the solution and save the digests into independent file.

SparkQA · 2020-05-21T09:58:19Z

Test build #122924 has finished for PR 28525 at commit 73727c4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-21T10:05:25Z

Test build #122925 has finished for PR 28525 at commit 09c498c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-21T15:11:36Z

Test build #122931 has finished for PR 28525 at commit c59fcd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-08-30T00:19:10Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions · 2020-12-08T00:46:58Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

probot-autolabeler bot added CORE DOCS labels May 14, 2020

turboFei mentioned this pull request May 14, 2020

[WIP][SPARK-27562][Shuffle]Complete the verification mechanism for shuffle transmitted data #24447

Closed

turboFei force-pushed the SPARK-27562 branch from 97e2a2d to 84e4ccc Compare May 14, 2020 03:41

turboFei changed the title ~~[SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data~~ [WIP][SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data May 14, 2020

turboFei changed the title ~~[WIP][SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data~~ [SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data May 14, 2020

turboFei force-pushed the SPARK-27562 branch from 3149ba0 to d13e1a2 Compare May 14, 2020 08:26

probot-autolabeler bot added the BUILD label May 15, 2020

turboFei and others added 8 commits May 15, 2020 10:40

[SPARK-27562][Shuffle] Complete the verification mechanism for shuffl…

b1ff257

…e transmitted data

refactor

d7bc087

save

be44d94

refactor

9a6c2f4

add ut

ae1c709

save

30354ae

style

df226c4

close before recreating

bb15a4d

turboFei force-pushed the SPARK-27562 branch from e330ca1 to bb15a4d Compare May 15, 2020 02:40

retest this please

b0fdea8

retest this please

7ed97ec

retest this please

c893dc8

jerryshao reviewed May 19, 2020

View reviewed changes

common/network-common/src/main/java/org/apache/spark/network/util/DigestUtils.java Show resolved Hide resolved

jerryshao reviewed May 19, 2020

View reviewed changes

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java Outdated Show resolved Hide resolved

jerryshao reviewed May 19, 2020

View reviewed changes

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java Outdated Show resolved Hide resolved

jerryshao reviewed May 20, 2020

View reviewed changes

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java Show resolved Hide resolved

duanmeng reviewed May 21, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala Outdated Show resolved Hide resolved

duanmeng reviewed May 21, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

turboFei added 3 commits May 21, 2020 15:42

address comments

c59cf83

split index and digests

96fd0f7

save

09c498c

turboFei force-pushed the SPARK-27562 branch from 73727c4 to 09c498c Compare May 21, 2020 09:55

turboFei requested a review from jerryshao May 21, 2020 09:56

turboFei added 2 commits May 21, 2020 19:36

fix style

2ccbbd4

refactor

c59fcd6

github-actions bot added the Stale label Dec 8, 2020

github-actions bot closed this Dec 9, 2020

pan3793 mentioned this pull request Feb 27, 2022

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

Closed

[SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data #28525

[SPARK-27562][Shuffle] Complete the verification mechanism for shuffle transmitted data #28525

Uh oh!

Conversation

turboFei commented May 14, 2020

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

turboFei commented May 14, 2020

Uh oh!

cloud-fan commented May 14, 2020

Uh oh!

SparkQA commented May 14, 2020

Uh oh!

yaooqinn commented May 14, 2020

Uh oh!

dongjoon-hyun commented May 14, 2020

Uh oh!

SparkQA commented May 15, 2020

Uh oh!

SparkQA commented May 15, 2020

Uh oh!

SparkQA commented May 15, 2020

Uh oh!

SparkQA commented May 15, 2020

Uh oh!

SparkQA commented May 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerryshao commented May 19, 2020

Uh oh!

Uh oh!

jerryshao commented May 20, 2020

Uh oh!

Uh oh!

Uh oh!

turboFei commented May 21, 2020

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

SparkQA commented May 21, 2020

Uh oh!

github-actions bot commented Aug 30, 2020

Uh oh!

github-actions bot commented Dec 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants