Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 3, 2018

What changes were proposed in this pull request?

Currently, Spark writes Spark version number into Hive Table properties with spark.sql.create.version.

parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}

This PR aims to write Spark versions to ORC/Parquet file metadata with org.apache.spark.sql.create.version because we used org.apache. prefix in Parquet metadata already. It's different from Hive Table property key spark.sql.create.version, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

ORC (native and hive implmentation)

$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0

PARQUET

$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

How was this patch tested?

Pass the Jenkins with newly added test cases.

This closes #22255.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite.

@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98420 has finished for PR 22932 at commit 601ccbb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98421 has finished for PR 22932 at commit 601ccbb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by adding org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, @gatorsmile .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does it mean that basically the tests will be failed or fixed for official releases (since it doesn't have -SNAPSHOT)?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Hmm. I think we should not depend on the number of bytes in the test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm .. yea, I think we should avoid ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's filed and I made a PR for SPARK-25971 for SQLQueryTestSuite.

@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98429 has finished for PR 22932 at commit 1ed6368.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The last commit will pass the test. The previous one fails due to spaces at the end.

@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98430 has finished for PR 22932 at commit f5d35b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review this please, @gatorsmile ?

@gatorsmile
Copy link
Member

Will take a look this week. Thanks for your work!

@dongjoon-hyun
Copy link
Member Author

I see. Thanks, @gatorsmile .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a pre-existing key? Seems that org.apache.spark.version should be enough.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Nov 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @hvanhovell . Yes, we can use that org.apache.spark.version since this is a new key.

Although Hive table property spark.sql.create.version has .sql.create. part, it seems that we don't need to follow that convention here.

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98456 has finished for PR 22932 at commit ef49a27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98539 has finished for PR 22932 at commit ef49a27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 7, 2018

Could you review this please, @gatorsmile ?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it useful to add a prop for whether it's written using the native writer?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 7, 2018

Thank you for review, @felixcheung . Could you elaborate on it a little bit more? Here, three writers are used: new native ORC writer, old Hive ORC writer, and native Parquet writer.

a prop for whether it's written using the native writer?

@dongjoon-hyun
Copy link
Member Author

@felixcheung . If the question is about writer versions, Spark/Hive works on top of ORC/Parquet library. ORC/Parquet library already writes its specific version for that purpose. For me, it looks enough.

ORC

$ orc-tools meta /tmp/o
File Version: 0.12 with ORC_135

Parquet

$ parquet-tools meta /tmp/o
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)

@felixcheung
Copy link
Member

felixcheung commented Nov 8, 2018 via email

@dongjoon-hyun
Copy link
Member Author

Yes. It does. If you use spark.sql.orc.impl=hive. It has a different version number like the following.

File Version: 0.12 with HIVE_8732

@SparkQA
Copy link

SparkQA commented Nov 9, 2018

Test build #98625 has finished for PR 22932 at commit 396540a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
val writer = OrcFile.createWriter(filename, options)
val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create a separate function for adding these metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @gatorsmile . Sure. I'll refactor out the following line.

writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))

writer.addUserMetadata(SPARK_VERSION_METADATA_KEY, UTF_8.encode(SPARK_VERSION_SHORT))
} catch {
case NonFatal(e) => log.warn(e.toString, e)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, as you expected, we cannot use a single function for this. The Writer are not the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I'll refactor out all the new code (line 281 ~ 289).

@SparkQA
Copy link

SparkQA commented Nov 10, 2018

Test build #98671 has finished for PR 22932 at commit 04457be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val filename = orcOutputFormat.getDefaultWorkFile(context, ".orc")
val options = OrcMapRedOutputFormat.buildOptions(context.getConfiguration)
val writer = OrcFile.createWriter(filename, options)
val recordWriter = new OrcMapreduceRecordWriter[OrcStruct](writer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically copied from getRecordWriter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. To avoid reflection, this was the only way.

@gatorsmile
Copy link
Member

LGTM. Thanks! Merged to master.

@dongjoon-hyun
Copy link
Member Author

Thank you so much!

@asfgit asfgit closed this in d66a4e8 Nov 10, 2018
@dongjoon-hyun dongjoon-hyun deleted the SPARK-25102 branch November 10, 2018 07:05
@HyukjinKwon
Copy link
Member

double checked. A late LGTM too

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

This closes apache#22255.

Closes apache#22932 from dongjoon-hyun/SPARK-25102.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Apr 7, 2020
### What changes were proposed in this pull request?

This is a backport of #22932 .

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

### Why are the changes needed?

This backport helps us handle this files differently in Apache Spark 3.0.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with newly added test cases.

Closes #28142 from dongjoon-hyun/SPARK-25102-2.4.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants