Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Oct 23, 2019

What changes were proposed in this pull request?

Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a buffered stream to avoid overhead excessive of JNI call while trying to compress/decompress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if it's needed.

Why are the changes needed?

Faster performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@dbtsai
Copy link
Member Author

dbtsai commented Oct 23, 2019

cc @tgravescs @viirya

@dongjoon-hyun
Copy link
Member

Looks good to me. I'm regenerating the result.

@dongjoon-hyun
Copy link
Member

The initial result looks better than before. It's faster (2x+) and there is a size reduction, too.

@viirya
Copy link
Member

viirya commented Oct 23, 2019

looks good in general, just few questions.

@dongjoon-hyun
Copy link
Member

Hi, @dbtsai . Please review and merge the result. The result is good!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dbtsai
Copy link
Member Author

dbtsai commented Oct 23, 2019

Thank you all for reviewing. Thanks @dongjoon-hyun for running the benchmark.

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112563 has finished for PR 26235 at commit 89e2af2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112564 has finished for PR 26235 at commit d0b1f4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master. Thank you, @dbtsai and @viirya .

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112566 has finished for PR 26235 at commit e5acdbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val objOut = new ObjectOutputStream(out)
out.write(DIRECT)
val codec = CompressionCodec.createCodec(conf, "zstd")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other compressions have conf. Could we do it for this too? See the examples:

private[this] val compressBroadcast = conf.get(config.BROADCAST_COMPRESS)
// Whether to compress shuffle output that are stored
private[this] val compressShuffle = conf.get(config.SHUFFLE_COMPRESS)
// Whether to compress RDD partitions that are stored serialized
private[this] val compressRdds = conf.get(config.RDD_COMPRESS)
// Whether to compress shuffle output temporarily spilled to disk
private[this] val compressShuffleSpill = conf.get(config.SHUFFLE_SPILL_COMPRESS)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought the up here in the other pr, please see discussion there: #26085

If you think its needed now then we should file a jira for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a JIRA https://issues.apache.org/jira/browse/SPARK-29939 @Ngone51 Could you submit a PR to fix it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure:)@gatorsmile

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…OutputStatus

Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a buffered stream to avoid overhead excessive of JNI call while trying to compress/decompress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if it's needed.

Faster performance.

No.

Existing tests.

Closes apache#26235 from dbtsai/optimizeDeser.

Lead-authored-by: DB Tsai <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Ref: LIHADOOP-56788
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants