-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…h only null value column
|
@wzhfy @maropu @cloud-fan @dongjoon-hyun Can you please have a look? |
|
To paraphrase for someone who doesn't know this code well: normally there's no way to have >0 rows but 0 distinct rows... except when the column is all null? distinct would return 0 rows? then a groupBy on that column still returns 1 grouping for that column? |
Seems like so. Can we check the null count to make sure it's an all-null column? |
@srowen Thanks for your comment. |
You mean define "all null value" is not only the distinct value is equal to 0 but also with null count is greater than 0? |
|
@cloud-fan null value count check has been added as well, please recheck when available. thanks |
| childStats.attributeStats(expr.asInstanceOf[Attribute]).distinctCount.get) | ||
| (res, expr) => { | ||
| val columnStat = childStats.attributeStats(expr.asInstanceOf[Attribute]) | ||
| val distinctValue: BigInt = if (columnStat.distinctCount.get == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned about performance here. Can you save the value of distinctCount.get so it isn't accessed twice, or is it not actually recomputed twice as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"childStats.distinctCount.get" is to get the value from the class variable/Option. The cost seems to be negligible, but it's better to save it by one variable.
|
Test build #4706 has finished for PR 24286 at commit
|
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
+1, LGTM. Merged to master/2.4. Thank you for your first contribution, @pengbo . |
…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes #24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>
|
FYI, the PR title is incomplete. |
| val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) { | ||
| 1 | ||
| } else { | ||
| distinctCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the nullCount is not empty, the value should be distinctCount + 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I will try and test it out. Another PR will be submitted if the problem exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. @gatorsmile . Originally, this PR aims to fix the case of column with only null value, but that's another case which we should fix. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I inferred that the distinct count already correctly counted a null as a distinct value. If not, yeah, then distinctCount doesn't matter; it should add 1 iff nullCount is > 0. Agree, if this is broader, can we get an additional test of that case too?
|
Thanks for your reviews! A general comment about the code review. We should try our best to improve the test coverage for the unit tests. This PR basically exposes one of the scenarios we missed before. Thus, we should try to ask the contributors to improve the other similar cases too. |
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Hi, @HyukjinKwon . |
|
Since we revert this approach, we need to find another way to avoid |
|
yup, sorry. It was reverted due to #24436 (comment) reason. |
|
@rxin , @gatorsmile , @cloud-fan , @HyukjinKwon . Sorry, but I'm wondering if that is a correct reason for revert.
Do we really need to revert this to prevent mismatch with |
|
Sorry, I think I rushed to revert. It's 0 as distinct count everywhere for spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala Line 270 in 239082d
Lines 50 to 51 in b1857a4
It's a bit late in my time. I will reread it closely tomorrow in KST. |
|
Thanks for confirming. I'll create reverting PRs for more reviews. |
|
I am sorry @dongjoon-hyun. It's my big mistake I apolosise that I rushed. Let me open a PR to revert next time. Can you revert my revert right away or open a PR to revert mine? |
|
I will revert my revert. |
|
Never mind~ Please go ahead! Thank you. |
|
Sorry guys. it was my huge mistake. I will make sure we don't make such mistake next time. |
…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>
…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>
…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache/spark#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2
How was this patch tested?
According unit test has been added, plus manual test has been done in our tpcds benchmark environement.