[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648

gengliangwang · 2019-01-25T07:03:20Z

What changes were proposed in this pull request?

When reading from empty tables, the optimization OptimizeMetadataOnlyQuery may return wrong results:

sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")

The result is supposed to be null. However, with the optimization the result is 5.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

How was this patch tested?

Unit test

…cords correctly When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. Unit test Closes apache#23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

gengliangwang · 2019-01-25T07:03:52Z

This PR is to port #23635 to branch 2.3

gatorsmile

LGTM pending Jenkins.

SparkQA · 2019-01-25T08:05:02Z

Test build #101667 has finished for PR 23648 at commit d782e4a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-25T08:24:26Z

retest this please.

SparkQA · 2019-01-25T10:15:33Z

Test build #101670 has finished for PR 23648 at commit d782e4a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-25T14:22:02Z

Test build #101677 has finished for PR 23648 at commit deb84ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-01-25T15:06:53Z

It seems the test couldn't create an empty parition crrectly..... how about this?

sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT 1")
sql("TRUNCATE TABLE t PARTITION (p1 = 5)")

gengliangwang · 2019-01-25T16:07:36Z

@maropu Thanks.

sql("TRUNCATE TABLE t PARTITION (p1 = 5)")

will throw exception

org.apache.spark.sql.catalyst.analysis.NoSuchPartitionException: Partition not found in table 't' database 'default':
[info] p1 -> 5;

SparkQA · 2019-01-25T18:48:04Z

Test build #101683 has finished for PR 23648 at commit 096bb6c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-25T19:10:49Z

Test build #101682 has finished for PR 23648 at commit bf280cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-25T19:43:26Z

retest this please.

SparkQA · 2019-01-25T23:09:09Z

Test build #101688 has finished for PR 23648 at commit 096bb6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-01-26T00:24:01Z

LGTM

…dle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23648 from gengliangwang/SPARK-26709. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

maropu · 2019-01-26T00:27:49Z

Thanks! Merged to branch-2.3.

gatorsmile reviewed Jan 25, 2019

View reviewed changes

update test case

deb84ea

gengliangwang added 2 commits January 26, 2019 00:32

remove test case in SQLQuerySuite

bf280cf

add back test case

096bb6c

maropu changed the title ~~[BRANCH-2.3][SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly~~ [SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly Jan 26, 2019

maropu closed this Jan 26, 2019

[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648

[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648

Uh oh!

Conversation

gengliangwang commented Jan 25, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Jan 25, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

gengliangwang commented Jan 25, 2019

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

maropu commented Jan 25, 2019

Uh oh!

gengliangwang commented Jan 25, 2019

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

gengliangwang commented Jan 25, 2019

Uh oh!

SparkQA commented Jan 25, 2019

Uh oh!

maropu commented Jan 26, 2019

Uh oh!

maropu commented Jan 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants