[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata #22932

dongjoon-hyun · 2018-11-03T06:50:05Z

What changes were proposed in this pull request?

Currently, Spark writes Spark version number into Hive Table properties with spark.sql.create.version.

parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}

This PR aims to write Spark versions to ORC/Parquet file metadata with org.apache.spark.sql.create.version because we used org.apache. prefix in Parquet metadata already. It's different from Hive Table property key spark.sql.create.version, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

ORC (native and hive implmentation)

$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0

PARQUET

$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

How was this patch tested?

Pass the Jenkins with newly added test cases.

This closes #22255.

dongjoon-hyun · 2018-11-03T06:55:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite.

SparkQA · 2018-11-03T07:05:02Z

Test build #98420 has finished for PR 22932 at commit 601ccbb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-03T07:07:08Z

Retest this please.

SparkQA · 2018-11-03T10:00:52Z

Test build #98421 has finished for PR 22932 at commit 601ccbb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-11-03T18:32:20Z

sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out

This is caused by adding org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT?

Right, @gatorsmile .

Hm, does it mean that basically the tests will be failed or fixed for official releases (since it doesn't have -SNAPSHOT)?

Nice catch! Hmm. I think we should not depend on the number of bytes in the test case.

Hmmm .. yea, I think we should avoid ..

It's filed and I made a PR for SPARK-25971 for SQLQueryTestSuite.

SparkQA · 2018-11-03T19:02:55Z

Test build #98429 has finished for PR 22932 at commit 1ed6368.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-03T19:10:30Z

The last commit will pass the test. The previous one fails due to spaces at the end.

SparkQA · 2018-11-03T20:19:42Z

Test build #98430 has finished for PR 22932 at commit f5d35b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-04T18:06:48Z

Could you review this please, @gatorsmile ?

gatorsmile · 2018-11-04T19:38:45Z

Will take a look this week. Thanks for your work!

dongjoon-hyun · 2018-11-04T19:40:26Z

I see. Thanks, @gatorsmile .

hvanhovell · 2018-11-04T20:40:53Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

Is this a pre-existing key? Seems that org.apache.spark.version should be enough.

Thank you for review, @hvanhovell . Yes, we can use that org.apache.spark.version since this is a new key.

Although Hive table property spark.sql.create.version has .sql.create. part, it seems that we don't need to follow that convention here.

SparkQA · 2018-11-05T03:19:52Z

Test build #98456 has finished for PR 22932 at commit ef49a27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-06T23:47:40Z

Retest this please.

SparkQA · 2018-11-07T03:26:46Z

Test build #98539 has finished for PR 22932 at commit ef49a27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-07T05:03:12Z

Could you review this please, @gatorsmile ?

felixcheung

would it useful to add a prop for whether it's written using the native writer?

dongjoon-hyun · 2018-11-07T17:42:38Z

Thank you for review, @felixcheung . Could you elaborate on it a little bit more? Here, three writers are used: new native ORC writer, old Hive ORC writer, and native Parquet writer.

a prop for whether it's written using the native writer?

dongjoon-hyun · 2018-11-07T18:49:18Z

@felixcheung . If the question is about writer versions, Spark/Hive works on top of ORC/Parquet library. ORC/Parquet library already writes its specific version for that purpose. For me, it looks enough.

ORC

$ orc-tools meta /tmp/o
File Version: 0.12 with ORC_135

Parquet

$ parquet-tools meta /tmp/o
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)

felixcheung · 2018-11-08T05:49:53Z

Does it have different values for new native ORC writer, old Hive ORC writer

…

________________________________

dongjoon-hyun · 2018-11-08T05:53:59Z

Yes. It does. If you use spark.sql.orc.impl=hive. It has a different version number like the following.

File Version: 0.12 with HIVE_8732

SparkQA · 2018-11-09T01:07:49Z

Test build #98625 has finished for PR 22932 at commit 396540a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-11-09T23:41:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala

+    val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+    val writer = OrcFile.createWriter(filename, options)
+    val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
+    writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))


Could we create a separate function for adding these metadata?

Thank you for review, @gatorsmile . Sure. I'll refactor out the following line.

writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))

gatorsmile · 2018-11-09T23:42:08Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

+        writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))
+      } catch {
+        case NonFatal(e) => log.warn(e.toString, e)
+      }


The same comment here.

BTW, as you expected, we cannot use a single function for this. The Writer are not the same.

For this case, I'll refactor out all the new code (line 281 ~ 289).

SparkQA · 2018-11-10T05:05:53Z

Test build #98671 has finished for PR 22932 at commit 04457be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-11-10T06:35:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala

+    val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
+    val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
+    val writer = OrcFile.createWriter(filename, options)
+    val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)


This is basically copied from getRecordWriter

Right. To avoid reflection, this was the only way.

gatorsmile · 2018-11-10T06:41:41Z

LGTM. Thanks! Merged to master.

dongjoon-hyun · 2018-11-10T06:46:39Z

Thank you so much!

HyukjinKwon · 2018-11-10T09:37:06Z

double checked. A late LGTM too

## What changes were proposed in this pull request? Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. This closes apache#22255. Closes apache#22932 from dongjoon-hyun/SPARK-25102. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: gatorsmile <[email protected]>

### What changes were proposed in this pull request? This is a backport of #22932 . Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ### Why are the changes needed? This backport helps us handle this files differently in Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with newly added test cases. Closes #28142 from dongjoon-hyun/SPARK-25102-2.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun commented Nov 3, 2018

View reviewed changes

dongjoon-hyun mentioned this pull request Nov 3, 2018

[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255

Closed

gatorsmile reviewed Nov 3, 2018

View reviewed changes

hvanhovell reviewed Nov 4, 2018

View reviewed changes

felixcheung reviewed Nov 7, 2018

View reviewed changes

dongjoon-hyun added 4 commits November 8, 2018 12:37

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata

0c9ed6b

Update test cases based on Parquet file sizes.

dfffc2e

Use org.apache.spark.version.

8bf632c

Address comments

396540a

gatorsmile reviewed Nov 9, 2018

View reviewed changes

Refactor to functions

04457be

gatorsmile reviewed Nov 10, 2018

View reviewed changes

asfgit closed this in d66a4e8 Nov 10, 2018

dongjoon-hyun deleted the SPARK-25102 branch November 10, 2018 07:05

dongjoon-hyun mentioned this pull request Apr 7, 2020

[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata #28142

Closed

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata #22932

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata #22932

Uh oh!

Conversation

dongjoon-hyun commented Nov 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

dongjoon-hyun commented Nov 4, 2018

Uh oh!

gatorsmile commented Nov 4, 2018

Uh oh!

dongjoon-hyun commented Nov 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2018

Uh oh!

dongjoon-hyun commented Nov 6, 2018

Uh oh!

SparkQA commented Nov 7, 2018

Uh oh!

dongjoon-hyun commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 7, 2018

Uh oh!

felixcheung commented Nov 8, 2018 via email

Uh oh!

dongjoon-hyun commented Nov 8, 2018

Uh oh!

SparkQA commented Nov 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 3, 2018 •

edited

Loading

dongjoon-hyun Nov 8, 2018 •

edited

Loading

dongjoon-hyun Nov 4, 2018 •

edited

Loading

dongjoon-hyun commented Nov 7, 2018 •

edited

Loading

dongjoon-hyun commented Nov 7, 2018 •

edited

Loading