-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered.
#20076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.
…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Manual test.
| * Compression codec to use. By default use the value specified in SQLConf. | ||
| * Acceptable values are defined in [[shortParquetCompressionCodecNames]]. | ||
| */ | ||
| val compressionCodecClassName: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change compressionCodecClassName to compressionCodec instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Seems you're right.
@gatorsmile Are we mistaken, shouldn't we change ParquetOptions's compressionCodec to compressionCodecClassName ? Because OrcOptions and TextOptions are all using compressionCodec .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionCodecClassName is a better name. We should change all the others to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could alternatively say compressionCodecName here. It's rather names like UNCOMPRESSED, LZO, etc in this case. For the text based sources, they are canonical class names so I am okay with compressionCodecClassName but for ORC and Parquet these are not classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionCodecName is also fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, change all compressionCodecClassName and compressionCodec to compressionCodecName? In TextOptions ,JSONOptions and CSVOptions too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @HyukjinKwon
In TextOptions ,JSONOptions and CSVOptions, it's "Option[String]", but in OrcOptions and ParquetOptions, it's a "String".
Just change compressionCodecClassName in OrcOptions and ParquetOptions to compressionCodecName is ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do Parquet and ORC ones here for now if that's also fine to @gatorsmile.
|
cc @gatorsmile |
|
ok to test |
|
Test build #85372 has finished for PR 20076 at commit
|
update compressionCodecClassName to compressionCodecName
|
Test build #85377 has finished for PR 20076 at commit
|
|
Test build #85379 has finished for PR 20076 at commit
|
|
Test build #85378 has finished for PR 20076 at commit
|
Use ParquetOptions in test
|
Test build #85380 has finished for PR 20076 at commit
|
Fix tesr error
|
Test build #85381 has finished for PR 20076 at commit
|
|
Retest this please |
|
Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ... |
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SQLTestUtils | ||
|
|
||
| class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suite does not need TestHiveSingleton .
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it to sql/core.
|
Sure, let's revert back the rename then. |
|
Test build #85388 has finished for PR 20076 at commit
|
|
Well, I'll revert back the renaming. Any comments? @gatorsmile |
2 Move the test case to sql/core
Rename the test file name and class name
|
Test build #85394 has finished for PR 20076 at commit
|
|
Also add an end-to-end test case? For example, the one using in the https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties ? |
|
Does it mean what we do in the test case of another pr #19218 ? @gatorsmile |
|
Test build #85400 has finished for PR 20076 at commit
|
|
@gatorsmile |
|
Try this? CREATE TABLE A USING Parquet
OPTIONS('parquet.compression' = 'gzip')
AS SELECT 1 as col1 |
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to org.apache.spark.sql.execution.datasources.parquet? Seems this should not be in this package level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I had move it to org.apache.spark.sql.execution.datasources.parquet.
docs/sql-programming-guide.md
Outdated
| <td> | ||
| Sets the compression codec use when writing Parquet files. Acceptable values include: | ||
| uncompressed, snappy, gzip, lzo. | ||
| Sets the compression codec use when writing Parquet files. If other compression codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
| val PARQUET_COMPRESSION = buildConf("spark.sql.parquet.compression.codec") | ||
| .doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " + | ||
| "uncompressed, snappy, gzip, lzo.") | ||
| .doc("Sets the compression codec use when writing Parquet files. If other compression codec " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
|
|
||
| val ORC_COMPRESSION = buildConf("spark.sql.orc.compression.codec") | ||
| .doc("Sets the compression codec use when writing ORC files. Acceptable values include: " + | ||
| .doc("Sets the compression codec use when writing ORC files. If other compression codec " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use when/used when
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I had fixed them.
|
@gatorsmile |
|
Test build #85594 has finished for PR 20076 at commit
|
|
Test build #85595 has finished for PR 20076 at commit
|
| /** | ||
| * Options for the Parquet data source. | ||
| */ | ||
| private[parquet] class ParquetOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revive private[parquet]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It should be revived. Thanks.
|
Test build #85716 has finished for PR 20076 at commit
|
| |'parquet.compression'='$compressionCodec')""".stripMargin | ||
| val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else "" | ||
| sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate | ||
| |as select 1 as col1, 2 as p""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val options =
s"""
|OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName',
|'parquet.compression'='$compressionCodec')
""".stripMargin
val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
sql(
s"""
|CREATE TABLE $tableName USING Parquet $options $partitionCreate
|AS SELECT 1 AS col1, 2 AS p
""".stripMargin)
| .doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " + | ||
| "uncompressed, snappy, gzip, lzo.") | ||
| .doc("Sets the compression codec used when writing Parquet files. If other compression codec " + | ||
| "configuration was found through hive or parquet, the precedence would be `compression`, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sets the compression codec used when writing Parquet files. If either
compressionorparquet.compressionis specified in the table-specific options/properties, the precedence would becompression, ...
Fix scala style
Change the describtion of spark.sql.parquet.compression
Change describtion
|
Test build #85741 has finished for PR 20076 at commit
|
|
Test build #85739 has finished for PR 20076 at commit
|
|
Test build #85740 has finished for PR 20076 at commit
|
|
LGTM Thanks! Merged to master/2.3 |
…quetOptions', `parquet.compression` needs to be considered. [SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the precedence order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Add test. Author: fjh100456 <[email protected]> Closes #20076 from fjh100456/ParquetOptionIssue. (cherry picked from commit 7b78041) Signed-off-by: gatorsmile <[email protected]>
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions',
parquet.compressionneeds to be considered.What changes were proposed in this pull request?
Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0.
We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.
Changes:
1.Increased acquiring 'compressionCodecClassName' from
parquet.compression,and the precedence order iscompression,parquet.compression,spark.sql.parquet.compression.codec, just like what we do inOrcOptions.2.Change
spark.sql.parquet.compression.codecto support "none".Actually inParquetOptions,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none".3.Change
compressionCodetocompressionCodecClassName.How was this patch tested?
Add test.