Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Jul 15, 2014

This reduces shuffle compression memory usage by 3x.

…on codec.

This reduces shuffle compression memory usage by 3x.
@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1415. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16665/consoleFull

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2014

Do we want to change default for everything or only for shuffle ? (only shuffle wont impact anything outside of spark)
What would be impact on user data if we change for all ? (It is developerApi after all, so there might be user data consuming this ?)

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

This is actually only used in shuffle.

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

Actually I lied. Somebody else added some code to use the compression codec to compress event data ...

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

It's actually bad to use these compression codecs to compress event data, because there is no guarantee they can be decompressed on different platforms ...

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

cc @andrewor14 I guess you added the event code ...

@rxin rxin changed the title SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec [SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec Jul 15, 2014
@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

I looked into the event logger code and it appears that codec change should be fine. It figures out the codec for old data automatically anyway.

@andrewor14
Copy link
Contributor

Yes, we log the codec used in a separate file so we don't lock ourselves out of our old event logs. This change seems fine.

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2014

@andrewor14 do we also log the block size, etc of the codec used ?
If yes, then atleast for event data we should be fine.

IIRC we use the codec to compress
a) RDD (which could be in tachyon - and so shared between spark deployments ?),
b) shuffle (private to spark),
c) broadcast (private to spark).
d) event logging (discussed above)
e) checkpoint (could be shared between runs ?)

Other than (a) and (e), sharing data via others would be non trivial and something we dont need to support imo.
I am not very sure of (a) and (e) - thoughts ?

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

We should create a JIRA so compression streams use the first few bytes to track the compression codec and various settings it needs (for lzf/snappy/lz4, there isn't any).

I think currently compressed blocks in Tachyon can be a problem, and e I'm less sure about. @tdas can you comment?

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1415:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16665/consoleFull

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2014

weird that test failures - unrelated to this change

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2014

ah yes, blocksize is only used during compression time : and inferred from stream during decompression.
Then only class name should be sufficient

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

Yea the test failure isn't related.

If there is no objection, I'm going to merge this tomorrow. I will file a jira ticket so we can prepend compression codec information to compressed data and then perhaps pick compression codec during decompression based on that.

@mridulm
Copy link
Contributor

mridulm commented Jul 15, 2014

Cant comment on tachyon since we dont use it and have no experience with it unfortunately.
I am fine with this change for the rest.

@pwendell
Copy link
Contributor

@rxin IIRC at one point we changed this before and it caused a performance regression for our perf suite so we reverted it. At the time I think we were running on smaller data sets though. Maybe in this case were are willing to take a hit?

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

Yea - stability seems much more important than a small performance gain ....

@vanzin
Copy link
Contributor

vanzin commented Jul 15, 2014

Only the codec names are stored in the event logs; no other information is currently recorded. But this change isn't really breaking anything in that area. (And, by default, event logs are not compressed.)

@rxin
Copy link
Contributor Author

rxin commented Jul 15, 2014

FYI filed JIRA: https://issues.apache.org/jira/browse/SPARK-2496 Compression streams should write its codec info to the stream

@rxin
Copy link
Contributor Author

rxin commented Jul 16, 2014

Ok I'm merging this one. Thanks guys.

@asfgit asfgit closed this in 4576d80 Jul 16, 2014
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 18, 2014
…ion codec

This reduces shuffle compression memory usage by 3x.

Author: Reynold Xin <[email protected]>

Closes apache#1415 from rxin/snappy and squashes the following commits:

06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
@rxin rxin deleted the snappy branch August 13, 2014 08:02
asfgit pushed a commit that referenced this pull request Aug 26, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
asfgit pushed a commit that referenced this pull request Aug 26, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

(cherry picked from commit 8856c3d)
Signed-off-by: Michael Armbrust <[email protected]>
kayousterhout pushed a commit to kayousterhout/spark-1 that referenced this pull request Aug 27, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
apache#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes apache#2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ion codec

This reduces shuffle compression memory usage by 3x.

Author: Reynold Xin <[email protected]>

Closes apache#1415 from rxin/snappy and squashes the following commits:

06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
apache#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes apache#2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants