Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

When reading from empty tables, the optimization OptimizeMetadataOnlyQuery may return wrong results:

sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")

The result is supposed to be null. However, with the optimization the result is 5.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

How was this patch tested?

Unit test

…cords correctly

When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

Unit test

Closes apache#23635 from gengliangwang/optimizeMetadata.

Lead-authored-by: Gengliang Wang <[email protected]>
Co-authored-by: Xiao Li <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
@gengliangwang
Copy link
Member Author

This PR is to port #23635 to branch 2.3

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101667 has finished for PR 23648 at commit d782e4a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101670 has finished for PR 23648 at commit d782e4a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101677 has finished for PR 23648 at commit deb84ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 25, 2019

It seems the test couldn't create an empty parition crrectly..... how about this?

sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT 1")
sql("TRUNCATE TABLE t PARTITION (p1 = 5)")

@gengliangwang
Copy link
Member Author

@maropu Thanks.

sql("TRUNCATE TABLE t PARTITION (p1 = 5)")

will throw exception

org.apache.spark.sql.catalyst.analysis.NoSuchPartitionException: Partition not found in table 't' database 'default':
[info] p1 -> 5;

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101683 has finished for PR 23648 at commit 096bb6c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101682 has finished for PR 23648 at commit bf280cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101688 has finished for PR 23648 at commit 096bb6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 26, 2019

LGTM

@maropu maropu changed the title [BRANCH-2.3][SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly [SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly Jan 26, 2019
asfgit pushed a commit that referenced this pull request Jan 26, 2019
…dle empty records correctly

## What changes were proposed in this pull request?

When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

## How was this patch tested?
Unit test

Closes #23648 from gengliangwang/SPARK-26709.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
@maropu
Copy link
Member

maropu commented Jan 26, 2019

Thanks! Merged to branch-2.3.

@maropu maropu closed this Jan 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants