[SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec #1415

rxin · 2014-07-15T06:17:44Z

This reduces shuffle compression memory usage by 3x.

…on codec. This reduces shuffle compression memory usage by 3x.

SparkQA · 2014-07-15T06:22:55Z

QA tests have started for PR 1415. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16665/consoleFull

mridulm · 2014-07-15T06:26:29Z

Do we want to change default for everything or only for shuffle ? (only shuffle wont impact anything outside of spark)
What would be impact on user data if we change for all ? (It is developerApi after all, so there might be user data consuming this ?)

rxin · 2014-07-15T06:35:43Z

This is actually only used in shuffle.

rxin · 2014-07-15T06:36:55Z

Actually I lied. Somebody else added some code to use the compression codec to compress event data ...

rxin · 2014-07-15T06:38:09Z

It's actually bad to use these compression codecs to compress event data, because there is no guarantee they can be decompressed on different platforms ...

rxin · 2014-07-15T06:38:22Z

cc @andrewor14 I guess you added the event code ...

rxin · 2014-07-15T06:46:18Z

I looked into the event logger code and it appears that codec change should be fine. It figures out the codec for old data automatically anyway.

andrewor14 · 2014-07-15T06:47:59Z

Yes, we log the codec used in a separate file so we don't lock ourselves out of our old event logs. This change seems fine.

mridulm · 2014-07-15T06:56:08Z

@andrewor14 do we also log the block size, etc of the codec used ?
If yes, then atleast for event data we should be fine.

IIRC we use the codec to compress
a) RDD (which could be in tachyon - and so shared between spark deployments ?),
b) shuffle (private to spark),
c) broadcast (private to spark).
d) event logging (discussed above)
e) checkpoint (could be shared between runs ?)

Other than (a) and (e), sharing data via others would be non trivial and something we dont need to support imo.
I am not very sure of (a) and (e) - thoughts ?

rxin · 2014-07-15T07:15:24Z

We should create a JIRA so compression streams use the first few bytes to track the compression codec and various settings it needs (for lzf/snappy/lz4, there isn't any).

I think currently compressed blocks in Tachyon can be a problem, and e I'm less sure about. @tdas can you comment?

SparkQA · 2014-07-15T08:00:23Z

QA results for PR 1415:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16665/consoleFull

mridulm · 2014-07-15T08:49:29Z

weird that test failures - unrelated to this change

mridulm · 2014-07-15T08:50:30Z

ah yes, blocksize is only used during compression time : and inferred from stream during decompression.
Then only class name should be sufficient

rxin · 2014-07-15T08:51:13Z

Yea the test failure isn't related.

If there is no objection, I'm going to merge this tomorrow. I will file a jira ticket so we can prepend compression codec information to compressed data and then perhaps pick compression codec during decompression based on that.

mridulm · 2014-07-15T08:55:47Z

Cant comment on tachyon since we dont use it and have no experience with it unfortunately.
I am fine with this change for the rest.

pwendell · 2014-07-15T20:52:11Z

@rxin IIRC at one point we changed this before and it caused a performance regression for our perf suite so we reverted it. At the time I think we were running on smaller data sets though. Maybe in this case were are willing to take a hit?

rxin · 2014-07-15T22:11:08Z

Yea - stability seems much more important than a small performance gain ....

vanzin · 2014-07-15T22:15:15Z

Only the codec names are stored in the event logs; no other information is currently recorded. But this change isn't really breaking anything in that area. (And, by default, event logs are not compressed.)

rxin · 2014-07-15T22:17:28Z

FYI filed JIRA: https://issues.apache.org/jira/browse/SPARK-2496 Compression streams should write its codec info to the stream

rxin · 2014-07-16T01:47:24Z

Ok I'm merging this one. Thanks guys.

…ion codec This reduces shuffle compression memory usage by 3x. Author: Reynold Xin <[email protected]> Closes apache#1415 from rxin/snappy and squashes the following commits: 06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext (cherry picked from commit 8856c3d) Signed-off-by: Michael Armbrust <[email protected]>

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

…ion codec This reduces shuffle compression memory usage by 3x. Author: Reynold Xin <[email protected]> Closes apache#1415 from rxin/snappy and squashes the following commits: 06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

SPARK-2469: Use Snappy (instead of LZF) for default shuffle compressi…

06c1a01

…on codec. This reduces shuffle compression memory usage by 3x.

rxin changed the title ~~SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec~~ [SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec Jul 15, 2014

asfgit closed this in 4576d80 Jul 16, 2014

rxin deleted the snappy branch August 13, 2014 08:02

chutium mentioned this pull request Aug 21, 2014

[SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext #2039

Closed

[SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec #1415

[SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec #1415

Uh oh!

Conversation

rxin commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

mridulm commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

andrewor14 commented Jul 15, 2014

Uh oh!

mridulm commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

SparkQA commented Jul 15, 2014

Uh oh!

mridulm commented Jul 15, 2014

Uh oh!

mridulm commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

mridulm commented Jul 15, 2014

Uh oh!

pwendell commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

vanzin commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

rxin commented Jul 16, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants