Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

In Apache Spark 2.4, SPARK-23355 fixes a bug which ignores table properties during convertMetastore for tables created by STORED AS ORC/PARQUET.

For some Parquet tables having table properties like TBLPROPERTIES (parquet.compression 'NONE'), it was ignored by default before Apache Spark 2.4. After upgrading cluster, Spark will write uncompressed file which is different from Apache Spark 2.3 and old.

In order to provide full backward-compatibility, this introduces additional configuration spark.sql.hive.convertMetastoreTableProperty to restore the previous behavior by ignoring table properties.

How was this patch tested?

Pass the Jenkins.

private def isParquetProperty(key: String) =
key.startsWith("parquet.") || key.contains(".parquet.")
conf.getConf(HiveUtils.CONVERT_METASTORE_TABLE_PROPERTY) &&
key.startsWith("parquet.") || key.contains(".parquet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this missing ( and )?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Thanks!

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90331 has finished for PR 21259 at commit eaecabc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 8, 2018

Test build #90341 has finished for PR 21259 at commit 137e2dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review this backward compatibility PR, @cloud-fan , @gatorsmile , @mridulm , and @HyukjinKwon ?

- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.
- In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really possible that a user sets the table property and wants Spark to ignore it? I'm OK with mentioning it in the migration guide, but adding a conf seems a overkill to me. cc @gatorsmile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a property and expecting spark to ignore it does not sound logical (spark not honoring a property is a bug IMO - which, thankfully, has been fixed in 2.4).
Having said that, I agree with you that mentioning this in migration guide might be sufficient; we have behavior changes between versions all the time and a conf is not necessary when the change is in the right direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on ^ and @cloud-fan's.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviews, @cloud-fan , @mridulm , @HyukjinKwon ! I'll update like that.

@dongjoon-hyun
Copy link
Member Author

Thank you all. SPARK-24112 is closed and the doc issue goes as FOLLOWUP PR for SPARK-23355.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-convertMetastoreTableProperty branch January 7, 2019 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants