Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.

## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above

- In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list?

BTW, this only affects data source table. How about hive serde table? Are they consistent?

Could you add a test case? Create a table by the syntax like CREATE TABLE ... STORED AS PARQUET. You also need to turn off spark.sql.hive.convertMetastoreParquet in the test case.

Copy link
Contributor Author

@seancxmao seancxmao Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your advice, I did a thorough comparison between data source table and hive serde table.

Parquet data and tables are created via the following code:

val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")

CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'

spark.sql.hive.convertMetastoreParquet is set to false:

spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)

Below are the comparison results both without #22148 and with #22148.

The comparison result without #22148:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? resolved by #22148
1 true a, b, c a  a  
2     b null B NG  
3     c  
4     A AnalysisException AnalysisException  
5     B AnalysisException AnalysisException  
6     C AnalysisException AnalysisException  
7   A, B, C a AnalysisException  AnalysisException  
8     b AnalysisException  AnalysisException   
9     c AnalysisException  AnalysisException   
10     A null  NG   
11     B B  
12     C NG   
13 false a, b, c a  
14     b null  NG Y
15     c  
16     A  
17     B null  NG  Y
18     C  
19   A, B, C a null  NG  Y
20     b  
21     c NG   
22     A null  NG  Y
23     B  
24     C NG   

The comparison result with #22148 applied:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? introduced by #22148
1 true a, b, c a  
2     b null  NG   
3     c  
4     A AnalysisException  AnalysisException   
5     B AnalysisException  AnalysisException   
6     C AnalysisException  AnalysisException   
7   A, B, C a AnalysisException  AnalysisException   
8     b AnalysisException  AnalysisException   
9     c AnalysisException  AnalysisException   
10     A null  NG   
11     B  
12     C NG   
13 false a, b, c a  
14     b  
15     c RuntimeException  NG  Y
16     A  
17     B  
18     C RuntimeException  NG  Y
19   A, B, C a  
20     b  
21     c RuntimeException  NG   
22     A  
23     B  
24     C RuntimeException  NG   

We can see that data source table and hive serde table have two major differences about parquet field resolution

WRT parquet field resolution, shall we make hive serde table behavior consistent with data source table behavior? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should respect spark.sql.caseSensitive in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for the one you did?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was.

Second, why we treat it as a behavior change? I think it's a bug that we don't respect spark.sql.caseSensitive in field resolution. In general we should not add a config to restore a bug.

I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making 1, 2 consistent is enough. : )

Copy link
Member

@gatorsmile gatorsmile Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the parquet files could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases.

Copy link
Contributor

@yucai yucai Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong?

  1. parquet table created by Spark (using parquet) read by Spark reader
  2. parquet table created by Spark (using hive) read by Spark reader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22184 (comment) already shows they are not consistent, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing is based on spark.sql.hive.convertMetastoreParquet is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here.
I guess you mean 1 and 3 :). I understand now.

If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?


## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above

- As of version 2.3.1 Arrow functionality, including `pandas_udf` and `toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set to `True`, has been marked as experimental. These are still evolving and not currently recommended for use in production.
Expand Down