-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec #1415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on codec. This reduces shuffle compression memory usage by 3x.
|
QA tests have started for PR 1415. This patch merges cleanly. |
|
Do we want to change default for everything or only for shuffle ? (only shuffle wont impact anything outside of spark) |
|
This is actually only used in shuffle. |
|
Actually I lied. Somebody else added some code to use the compression codec to compress event data ... |
|
It's actually bad to use these compression codecs to compress event data, because there is no guarantee they can be decompressed on different platforms ... |
|
cc @andrewor14 I guess you added the event code ... |
|
I looked into the event logger code and it appears that codec change should be fine. It figures out the codec for old data automatically anyway. |
|
Yes, we log the codec used in a separate file so we don't lock ourselves out of our old event logs. This change seems fine. |
|
@andrewor14 do we also log the block size, etc of the codec used ? IIRC we use the codec to compress Other than (a) and (e), sharing data via others would be non trivial and something we dont need to support imo. |
|
We should create a JIRA so compression streams use the first few bytes to track the compression codec and various settings it needs (for lzf/snappy/lz4, there isn't any). I think currently compressed blocks in Tachyon can be a problem, and e I'm less sure about. @tdas can you comment? |
|
QA results for PR 1415: |
|
weird that test failures - unrelated to this change |
|
ah yes, blocksize is only used during compression time : and inferred from stream during decompression. |
|
Yea the test failure isn't related. If there is no objection, I'm going to merge this tomorrow. I will file a jira ticket so we can prepend compression codec information to compressed data and then perhaps pick compression codec during decompression based on that. |
|
Cant comment on tachyon since we dont use it and have no experience with it unfortunately. |
|
@rxin IIRC at one point we changed this before and it caused a performance regression for our perf suite so we reverted it. At the time I think we were running on smaller data sets though. Maybe in this case were are willing to take a hit? |
|
Yea - stability seems much more important than a small performance gain .... |
|
Only the codec names are stored in the event logs; no other information is currently recorded. But this change isn't really breaking anything in that area. (And, by default, event logs are not compressed.) |
|
FYI filed JIRA: https://issues.apache.org/jira/browse/SPARK-2496 Compression streams should write its codec info to the stream |
|
Ok I'm merging this one. Thanks guys. |
…ion codec This reduces shuffle compression memory usage by 3x. Author: Reynold Xin <[email protected]> Closes apache#1415 from rxin/snappy and squashes the following commits: 06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext (cherry picked from commit 8856c3d) Signed-off-by: Michael Armbrust <[email protected]>
…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
…ion codec This reduces shuffle compression memory usage by 3x. Author: Reynold Xin <[email protected]> Closes apache#1415 from rxin/snappy and squashes the following commits: 06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
This reduces shuffle compression memory usage by 3x.