Skip to content

Conversation

@chutium
Copy link
Contributor

@chutium chutium commented Aug 19, 2014

There are 4 different compression codec available for ParquetOutputFormat

in Spark SQL, it was set as a hard-coded value in ParquetRelation.defaultCompression

original discuss:
#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have started for PR 2039 at commit 21235dc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have finished for PR 2039 at commit 21235dc.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chutium
Copy link
Contributor Author

chutium commented Aug 19, 2014

it seems failed test cases are not related to this PR, all of them are in sql/hive-thriftserver modul

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps spark.sql.parquet.compression.codec to be closer to the related setting in Spark core?

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2039 at commit 21235dc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

Tests timed out after a configured wait of 120m.

@chutium
Copy link
Contributor Author

chutium commented Aug 26, 2014

thanks @marmbrus , i change the property name and default codec

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2039 at commit 2f44964.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2039 at commit 2f44964.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks! I've merged this into master and 1.1.

@asfgit asfgit closed this in 8856c3d Aug 26, 2014
asfgit pushed a commit that referenced this pull request Aug 26, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

(cherry picked from commit 8856c3d)
Signed-off-by: Michael Armbrust <[email protected]>
kayousterhout pushed a commit to kayousterhout/spark-1 that referenced this pull request Aug 27, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
apache#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes apache#2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ting ParquetFile in SQLContext

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
apache#195 (diff)

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <[email protected]>

Closes apache#2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants