Skip to content

Conversation

@pengbo
Copy link
Contributor

@pengbo pengbo commented Apr 3, 2019

What changes were proposed in this pull request?

The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

How was this patch tested?

According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

@pengbo
Copy link
Contributor Author

pengbo commented Apr 3, 2019

@wzhfy @maropu @cloud-fan @dongjoon-hyun Can you please have a look?

@srowen
Copy link
Member

srowen commented Apr 11, 2019

To paraphrase for someone who doesn't know this code well: normally there's no way to have >0 rows but 0 distinct rows... except when the column is all null? distinct would return 0 rows? then a groupBy on that column still returns 1 grouping for that column?

@cloud-fan
Copy link
Contributor

except when the column is all null? distinct would return 0 rows?

Seems like so. Can we check the null count to make sure it's an all-null column?

@pengbo
Copy link
Contributor Author

pengbo commented Apr 11, 2019

when the column is all null? distinct would return 0 rows? then a groupBy on that column still returns 1 grouping for that column?

@srowen Thanks for your comment.
Yes, that's actually what's happening. Currently if one group by col is all null, the group by output rows estimation will be always 0. Please recheck the example I provided, feel free to ask if you need more information.

@pengbo
Copy link
Contributor Author

pengbo commented Apr 11, 2019

except when the column is all null? distinct would return 0 rows?

Seems like so. Can we check the null count to make sure it's an all-null column?

You mean define "all null value" is not only the distinct value is equal to 0 but also with null count is greater than 0?

@pengbo
Copy link
Contributor Author

pengbo commented Apr 11, 2019

@cloud-fan null value count check has been added as well, please recheck when available. thanks

childStats.attributeStats(expr.asInstanceOf[Attribute]).distinctCount.get)
(res, expr) => {
val columnStat = childStats.attributeStats(expr.asInstanceOf[Attribute])
val distinctValue: BigInt = if (columnStat.distinctCount.get == 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about performance here. Can you save the value of distinctCount.get so it isn't accessed twice, or is it not actually recomputed twice as-is?

Copy link
Contributor Author

@pengbo pengbo Apr 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
"childStats.distinctCount.get" is to get the value from the class variable/Option. The cost seems to be negligible, but it's better to save it by one variable.

@SparkQA
Copy link

SparkQA commented Apr 14, 2019

Test build #4706 has finished for PR 24286 at commit 6a9d35f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 15, 2019

+1, LGTM. Merged to master/2.4.

Thank you for your first contribution, @pengbo .
Also, thank you, @attilapiros and @srowen .

dongjoon-hyun pushed a commit that referenced this pull request Apr 15, 2019
…ion wit…

## What changes were proposed in this pull request?
The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

## How was this patch tested?
According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

Closes #24286 from pengbo/master.

Lead-authored-by: pengbo <[email protected]>
Co-authored-by: mingbo_pb <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit c58a4fe)
Signed-off-by: Dongjoon Hyun <[email protected]>
@gatorsmile gatorsmile changed the title [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit… [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column Apr 22, 2019
@gatorsmile
Copy link
Member

FYI, the PR title is incomplete.

val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) {
1
} else {
distinctCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the nullCount is not empty, the value should be distinctCount + 1, right?

@pengbo @dongjoon-hyun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
I will try and test it out. Another PR will be submitted if the problem exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. @gatorsmile . Originally, this PR aims to fix the case of column with only null value, but that's another case which we should fix. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I inferred that the distinct count already correctly counted a null as a distinct value. If not, yeah, then distinctCount doesn't matter; it should add 1 iff nullCount is > 0. Agree, if this is broader, can we get an additional test of that case too?

@gatorsmile
Copy link
Member

Thanks for your reviews!

A general comment about the code review. We should try our best to improve the test coverage for the unit tests. This PR basically exposes one of the scenarios we missed before. Thus, we should try to ask the contributors to improve the other similar cases too.

dongjoon-hyun pushed a commit that referenced this pull request Apr 23, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes #24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 23, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes #24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Hi, @HyukjinKwon .
Since you revert this, could you comment on here about the reason, too?

@dongjoon-hyun
Copy link
Member

Since we revert this approach, we need to find another way to avoid output rows: 0 situation.

@HyukjinKwon
Copy link
Member

yup, sorry. It was reverted due to #24436 (comment) reason.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 23, 2019

@rxin , @gatorsmile , @cloud-fan , @HyukjinKwon .

Sorry, but I'm wondering if that is a correct reason for revert.

  • First, this PR doesn't affect SQL COUNT DISTINCT result and DataSet.stats result at all.
  • Second, this PR doesn't affect single column statistics like the following in Spark tables.
spark.sql.statistics.colStats.a.distinctCount=1, 
spark.sql.statistics.colStats.a.nullCount=1, 
spark.sql.statistics.colStats.b.distinctCount=0,
spark.sql.statistics.colStats.b.nullCount=2, 
  • Although this is might be exposed to users, this is a meaningful fix for internal AggregateEstimation in Apache Spark CBO.

Do we really need to revert this to prevent mismatch with Pandas or other user facing SQL output? This is internal statistics.

@HyukjinKwon
Copy link
Member

Sorry, I think I rushed to revert. It's 0 as distinct count everywhere for ColumnStat. I thought it's definitely a mistake cuz here we use it as 1.

val numNonNulls = if (col.nullable) Count(col) else Count(one)

ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount),
avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize))

It's a bit late in my time. I will reread it closely tomorrow in KST.

@dongjoon-hyun
Copy link
Member

Thanks for confirming. I'll create reverting PRs for more reviews.

@HyukjinKwon
Copy link
Member

I am sorry @dongjoon-hyun. It's my big mistake I apolosise that I rushed. Let me open a PR to revert next time. Can you revert my revert right away or open a PR to revert mine?
It shouldn't say the row count is 0 cuz null can be grouped as well.

@HyukjinKwon
Copy link
Member

I will revert my revert.

@dongjoon-hyun
Copy link
Member

Never mind~ Please go ahead! Thank you.

@HyukjinKwon
Copy link
Member

Sorry guys. it was my huge mistake. I will make sure we don't make such mistake next time.

kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…ion wit…

## What changes were proposed in this pull request?
The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

## How was this patch tested?
According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

Closes apache#24286 from pengbo/master.

Lead-authored-by: pengbo <[email protected]>
Co-authored-by: mingbo_pb <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit c58a4fe)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…ion wit…

## What changes were proposed in this pull request?
The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

## How was this patch tested?
According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

Closes apache#24286 from pengbo/master.

Lead-authored-by: pengbo <[email protected]>
Co-authored-by: mingbo_pb <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit c58a4fe)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…ion wit…

## What changes were proposed in this pull request?
The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect.
Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)
=> group by col1, col2
Actual: output rows: 0
Expected: output rows: 2

## How was this patch tested?
According unit test has been added, plus manual test has been done in our tpcds benchmark environement.

Closes apache#24286 from pengbo/master.

Lead-authored-by: pengbo <[email protected]>
Co-authored-by: mingbo_pb <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit c58a4fe)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
yoock pushed a commit to yoock/spark-apache that referenced this pull request Jan 14, 2020
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache/spark#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes #24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants