[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286

pengbo · 2019-04-03T16:33:25Z

What changes were proposed in this pull request?

The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

How was this patch tested?

According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

…h only null value column

pengbo · 2019-04-03T16:38:18Z

@wzhfy @maropu @cloud-fan @dongjoon-hyun Can you please have a look?

srowen · 2019-04-11T11:40:03Z

To paraphrase for someone who doesn't know this code well: normally there's no way to have >0 rows but 0 distinct rows... except when the column is all null? distinct would return 0 rows? then a groupBy on that column still returns 1 grouping for that column?

cloud-fan · 2019-04-11T11:56:43Z

except when the column is all null? distinct would return 0 rows?

Seems like so. Can we check the null count to make sure it's an all-null column?

pengbo · 2019-04-11T12:05:44Z

when the column is all null? distinct would return 0 rows? then a groupBy on that column still returns 1 grouping for that column?

@srowen Thanks for your comment.
Yes, that's actually what's happening. Currently if one group by col is all null, the group by output rows estimation will be always 0. Please recheck the example I provided, feel free to ask if you need more information.

pengbo · 2019-04-11T12:25:29Z

except when the column is all null? distinct would return 0 rows?

Seems like so. Can we check the null count to make sure it's an all-null column?

You mean define "all null value" is not only the distinct value is equal to 0 but also with null count is greater than 0?

pengbo · 2019-04-11T15:53:40Z

@cloud-fan null value count check has been added as well, please recheck when available. thanks

srowen · 2019-04-14T00:14:04Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

-          childStats.attributeStats(expr.asInstanceOf[Attribute]).distinctCount.get)
+        (res, expr) => {
+          val columnStat = childStats.attributeStats(expr.asInstanceOf[Attribute])
+          val distinctValue: BigInt = if (columnStat.distinctCount.get == 0 &&


I'm slightly concerned about performance here. Can you save the value of distinctCount.get so it isn't accessed twice, or is it not actually recomputed twice as-is?

Done.
"childStats.distinctCount.get" is to get the value from the class variable/Option. The cost seems to be negligible, but it's better to save it by one variable.

SparkQA · 2019-04-14T18:29:33Z

Test build #4706 has finished for PR 24286 at commit 6a9d35f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

LGTM

dongjoon-hyun · 2019-04-15T22:36:44Z

+1, LGTM. Merged to master/2.4.

Thank you for your first contribution, @pengbo .
Also, thank you, @attilapiros and @srowen .

…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes #24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>

gatorsmile · 2019-04-22T07:03:56Z

FYI, the PR title is incomplete.

gatorsmile · 2019-04-22T07:11:45Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

+          val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) {
+            1
+          } else {
+            distinctCount


If the nullCount is not empty, the value should be distinctCount + 1, right?

@pengbo @dongjoon-hyun

Good point.
I will try and test it out. Another PR will be submitted if the problem exists.

Oh, right. @gatorsmile . Originally, this PR aims to fix the case of column with only null value, but that's another case which we should fix. Thanks!

Oh, I inferred that the distinct count already correctly counted a null as a distinct value. If not, yeah, then distinctCount doesn't matter; it should add 1 iff nullCount is > 0. Agree, if this is broader, can we get an additional test of that case too?

gatorsmile · 2019-04-22T07:20:55Z

Thanks for your reviews!

A general comment about the code review. We should try our best to improve the test coverage for the unit tests. This PR basically exposes one of the scenarios we missed before. Thus, we should try to ask the contributors to improve the other similar cases too.

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2019-05-23T18:33:55Z

Hi, @HyukjinKwon .
Since you revert this, could you comment on here about the reason, too?

dongjoon-hyun · 2019-05-23T18:40:02Z

Since we revert this approach, we need to find another way to avoid output rows: 0 situation.

HyukjinKwon · 2019-05-23T18:43:16Z

yup, sorry. It was reverted due to #24436 (comment) reason.

dongjoon-hyun · 2019-05-23T19:22:36Z

@rxin , @gatorsmile , @cloud-fan , @HyukjinKwon .

Sorry, but I'm wondering if that is a correct reason for revert.

First, this PR doesn't affect SQL COUNT DISTINCT result and DataSet.stats result at all.
Second, this PR doesn't affect single column statistics like the following in Spark tables.

spark.sql.statistics.colStats.a.distinctCount=1, 
spark.sql.statistics.colStats.a.nullCount=1, 
spark.sql.statistics.colStats.b.distinctCount=0,
spark.sql.statistics.colStats.b.nullCount=2,

Although this is might be exposed to users, this is a meaningful fix for internal AggregateEstimation in Apache Spark CBO.

Do we really need to revert this to prevent mismatch with Pandas or other user facing SQL output? This is internal statistics.

HyukjinKwon · 2019-05-23T20:09:14Z

Sorry, I think I rushed to revert. It's 0 as distinct count everywhere for ColumnStat. I thought it's definitely a mistake cuz here we use it as 1.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

Line 270 in 239082d

val numNonNulls = if (col.nullable) Count(col) else Count(one)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

Lines 50 to 51 in b1857a4

    
           ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount), 
        
             avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize))

It's a bit late in my time. I will reread it closely tomorrow in KST.

dongjoon-hyun · 2019-05-23T20:19:26Z

Thanks for confirming. I'll create reverting PRs for more reviews.

HyukjinKwon · 2019-05-23T20:30:59Z

I am sorry @dongjoon-hyun. It's my big mistake I apolosise that I rushed. Let me open a PR to revert next time. Can you revert my revert right away or open a PR to revert mine?
It shouldn't say the row count is 0 cuz null can be grouped as well.

HyukjinKwon · 2019-05-23T20:34:10Z

I will revert my revert.

dongjoon-hyun · 2019-05-23T20:34:32Z

Never mind~ Please go ahead! Thank you.

HyukjinKwon · 2019-05-23T20:35:50Z

Sorry guys. it was my huge mistake. I will make sure we don't make such mistake next time.

…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache/spark#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

SPARK-27351 Wrong outputRows estimation after AggregateEstimation wit…

3e91bbe

…h only null value column

merge upstream/master

7c7685c

check null value count for only null value column

aa9859a

pengbo force-pushed the master branch from 3d89ba4 to aa9859a Compare April 12, 2019 10:22

srowen reviewed Apr 14, 2019

View reviewed changes

refine code to make columnStat.distinctCount.get called only once

6a9d35f

attilapiros reviewed Apr 15, 2019

View reviewed changes

srowen approved these changes Apr 15, 2019

View reviewed changes

dongjoon-hyun approved these changes Apr 15, 2019

View reviewed changes

dongjoon-hyun closed this in c58a4fe Apr 15, 2019

gatorsmile changed the title ~~[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit…~~ [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column Apr 22, 2019

gatorsmile reviewed Apr 22, 2019

View reviewed changes

pengbo mentioned this pull request Apr 22, 2019

[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

Closed

[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286

[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286

Uh oh!

Conversation

pengbo commented Apr 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

pengbo commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Apr 11, 2019

Uh oh!

cloud-fan commented Apr 11, 2019

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

pengbo commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengbo commented Apr 11, 2019

Uh oh!

srowen Apr 14, 2019

Choose a reason for hiding this comment

Uh oh!

pengbo Apr 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2019

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Apr 22, 2019

Uh oh!

gatorsmile Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

pengbo Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 22, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pengbo commented Apr 3, 2019 •

edited

Loading

pengbo commented Apr 11, 2019 •

edited

Loading

pengbo Apr 14, 2019 •

edited

Loading

dongjoon-hyun commented Apr 15, 2019 •

edited

Loading

dongjoon-hyun commented May 23, 2019 •

edited

Loading