[SPARK-29576][Core] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus #26235

dbtsai · 2019-10-23T21:34:50Z

What changes were proposed in this pull request?

Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a buffered stream to avoid overhead excessive of JNI call while trying to compress/decompress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if it's needed.

Why are the changes needed?

Faster performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

dbtsai · 2019-10-23T21:47:29Z

cc @tgravescs @viirya

dongjoon-hyun · 2019-10-23T21:56:00Z

Looks good to me. I'm regenerating the result.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

dongjoon-hyun · 2019-10-23T22:14:18Z

The initial result looks better than before. It's faster (2x+) and there is a size reduction, too.

viirya · 2019-10-23T22:18:45Z

looks good in general, just few questions.

dongjoon-hyun · 2019-10-23T22:48:50Z

Hi, @dbtsai . Please review and merge the result. The result is good!

EC2 Result dbtsai/spark#8

EC2 Result

dongjoon-hyun

+1, LGTM.

dbtsai · 2019-10-23T22:53:40Z

Thank you all for reviewing. Thanks @dongjoon-hyun for running the benchmark.

SparkQA · 2019-10-23T23:58:29Z

Test build #112563 has finished for PR 26235 at commit 89e2af2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-24T00:41:44Z

Test build #112564 has finished for PR 26235 at commit d0b1f4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-24T01:17:03Z

Merged to master. Thank you, @dbtsai and @viirya .

SparkQA · 2019-10-24T01:29:10Z

Test build #112566 has finished for PR 26235 at commit e5acdbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-11-07T07:39:37Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-
-    val objOut = new ObjectOutputStream(out)
+    out.write(DIRECT)
+    val codec = CompressionCodec.createCodec(conf, "zstd")


All the other compressions have conf. Could we do it for this too? See the examples:

spark/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala

Lines 67 to 73 in 1b575ef

private[this] val compressBroadcast = conf.get(config.BROADCAST_COMPRESS)

// Whether to compress shuffle output that are stored

private[this] val compressShuffle = conf.get(config.SHUFFLE_COMPRESS)

// Whether to compress RDD partitions that are stored serialized

private[this] val compressRdds = conf.get(config.RDD_COMPRESS)

// Whether to compress shuffle output temporarily spilled to disk

private[this] val compressShuffleSpill = conf.get(config.SHUFFLE_SPILL_COMPRESS)

cc @zsxwing @cloud-fan

I brought the up here in the other pr, please see discussion there: #26085

If you think its needed now then we should file a jira for it.

Created a JIRA https://issues.apache.org/jira/browse/SPARK-29939 @Ngone51 Could you submit a PR to fix it?

Sure:)@gatorsmile

…OutputStatus Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a buffered stream to avoid overhead excessive of JNI call while trying to compress/decompress small amount of data. Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if it's needed. Faster performance. No. Existing tests. Closes apache#26235 from dbtsai/optimizeDeser. Lead-authored-by: DB Tsai <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Ref: LIHADOOP-56788

dongjoon-hyun reviewed Oct 23, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

dongjoon-hyun added the SPARK CORE label Oct 23, 2019

dbtsai added 3 commits October 23, 2019 14:58

Use Spark's CompressionCodec and optimize deserializeMapStatuses

fcf64b2

fix import

da276aa

add comment

d0b1f4c

dbtsai force-pushed the optimizeDeser branch from 2a549e7 to d0b1f4c Compare October 23, 2019 21:58

viirya reviewed Oct 23, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

viirya reviewed Oct 23, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

Add JDK11

013e7b2

Add JDK8

6212855

Merge pull request #8 from dongjoon-hyun/PR-26235

e5acdbf

EC2 Result

dongjoon-hyun approved these changes Oct 23, 2019

View reviewed changes

viirya approved these changes Oct 23, 2019

View reviewed changes

dongjoon-hyun closed this in fd899d6 Oct 24, 2019

gatorsmile reviewed Nov 7, 2019

View reviewed changes

	private[this] val compressBroadcast = conf.get(config.BROADCAST_COMPRESS)
	// Whether to compress shuffle output that are stored
	private[this] val compressShuffle = conf.get(config.SHUFFLE_COMPRESS)
	// Whether to compress RDD partitions that are stored serialized
	private[this] val compressRdds = conf.get(config.RDD_COMPRESS)
	// Whether to compress shuffle output temporarily spilled to disk
	private[this] val compressShuffleSpill = conf.get(config.SHUFFLE_SPILL_COMPRESS)

[SPARK-29576][Core] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus #26235

[SPARK-29576][Core] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus #26235

Uh oh!

Conversation

dbtsai commented Oct 23, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

dbtsai commented Oct 23, 2019

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

viirya commented Oct 23, 2019

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Oct 23, 2019

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

dongjoon-hyun commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

gatorsmile Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

tgravescs Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

Ngone51 Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants