-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…cords correctly
When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.
The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.
It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.
Unit test
Closes apache#23635 from gengliangwang/optimizeMetadata.
Lead-authored-by: Gengliang Wang <[email protected]>
Co-authored-by: Xiao Li <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
|
This PR is to port #23635 to branch 2.3 |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending Jenkins.
|
Test build #101667 has finished for PR 23648 at commit
|
|
retest this please. |
|
Test build #101670 has finished for PR 23648 at commit
|
|
Test build #101677 has finished for PR 23648 at commit
|
|
It seems the test couldn't create an empty parition crrectly..... how about this? |
|
@maropu Thanks. will throw exception |
|
Test build #101683 has finished for PR 23648 at commit
|
|
Test build #101682 has finished for PR 23648 at commit
|
|
retest this please. |
|
Test build #101688 has finished for PR 23648 at commit
|
|
LGTM |
…dle empty records correctly
## What changes were proposed in this pull request?
When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.
The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.
It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.
## How was this patch tested?
Unit test
Closes #23648 from gengliangwang/SPARK-26709.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
|
Thanks! Merged to branch-2.3. |
What changes were proposed in this pull request?
When reading from empty tables, the optimization
OptimizeMetadataOnlyQuerymay return wrong results:The result is supposed to be
null. However, with the optimization the result is5.The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.
It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.
How was this patch tested?
Unit test