-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata #22932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite.
|
Test build #98420 has finished for PR 22932 at commit
|
|
Retest this please. |
|
Test build #98421 has finished for PR 22932 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caused by adding org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, @gatorsmile .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, does it mean that basically the tests will be failed or fixed for official releases (since it doesn't have -SNAPSHOT)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Hmm. I think we should not depend on the number of bytes in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm .. yea, I think we should avoid ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's filed and I made a PR for SPARK-25971 for SQLQueryTestSuite.
|
Test build #98429 has finished for PR 22932 at commit
|
|
The last commit will pass the test. The previous one fails due to |
|
Test build #98430 has finished for PR 22932 at commit
|
|
Could you review this please, @gatorsmile ? |
|
Will take a look this week. Thanks for your work! |
|
I see. Thanks, @gatorsmile . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a pre-existing key? Seems that org.apache.spark.version should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @hvanhovell . Yes, we can use that org.apache.spark.version since this is a new key.
Although Hive table property spark.sql.create.version has .sql.create. part, it seems that we don't need to follow that convention here.
|
Test build #98456 has finished for PR 22932 at commit
|
|
Retest this please. |
|
Test build #98539 has finished for PR 22932 at commit
|
|
Could you review this please, @gatorsmile ? |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it useful to add a prop for whether it's written using the native writer?
|
Thank you for review, @felixcheung . Could you elaborate on it a little bit more? Here, three writers are used: new native ORC writer, old Hive ORC writer, and native Parquet writer.
|
|
@felixcheung . If the question is about writer versions, Spark/Hive works on top of ORC/Parquet library. ORC/Parquet library already writes its specific version for that purpose. For me, it looks enough. ORC Parquet |
|
Does it have different values for
new native ORC writer, old Hive ORC writer
…________________________________
|
|
Yes. It does. If you use |
|
Test build #98625 has finished for PR 22932 at commit
|
| val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) | ||
| val writer = OrcFile.createWriter(filename, options) | ||
| val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) | ||
| writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create a separate function for adding these metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @gatorsmile . Sure. I'll refactor out the following line.
writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))
| writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT)) | ||
| } catch { | ||
| case NonFatal(e) => log.warn(e.toString, e) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, as you expected, we cannot use a single function for this. The Writer are not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, I'll refactor out all the new code (line 281 ~ 289).
|
Test build #98671 has finished for PR 22932 at commit
|
| val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc") | ||
| val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration) | ||
| val writer = OrcFile.createWriter(filename, options) | ||
| val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically copied from getRecordWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. To avoid reflection, this was the only way.
|
LGTM. Thanks! Merged to master. |
|
Thank you so much! |
|
double checked. A late LGTM too |
## What changes were proposed in this pull request?
Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
},
transient_lastDdlTime=1541142761,
spark.sql.sources.schema.numParts=1,
spark.sql.create.version=2.4.0
}
```
This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.
After this PR, ORC and Parquet file generated by Spark will have the following metadata.
**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
org.apache.spark.sql.create.version=3.0.0
```
**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.create.version = 3.0.0
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
This closes apache#22255.
Closes apache#22932 from dongjoon-hyun/SPARK-25102.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
### What changes were proposed in this pull request? This is a backport of #22932 . Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ### Why are the changes needed? This backport helps us handle this files differently in Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with newly added test cases. Closes #28142 from dongjoon-hyun/SPARK-25102-2.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Currently, Spark writes Spark version number into Hive Table properties with
spark.sql.create.version.This PR aims to write Spark versions to ORC/Parquet file metadata with
org.apache.spark.sql.create.versionbecause we usedorg.apache.prefix in Parquet metadata already. It's different from Hive Table property keyspark.sql.create.version, but it seems that we cannot change Hive Table property for backward compatibility.After this PR, ORC and Parquet file generated by Spark will have the following metadata.
ORC (
nativeandhiveimplmentation)PARQUET
How was this patch tested?
Pass the Jenkins with newly added test cases.
This closes #22255.