-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29874][SQL]Optimize Dataset.isEmpty() #26500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan |
| */ | ||
| def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => | ||
| plan.executeCollect().head.getLong(0) == 0 | ||
| def isEmpty: Boolean = withAction("isEmpty", queryExecution) { plan => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do column pruning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do column pruning?
Of course, add it.
|
Can we have some benchmark numbers? |
|
ok to test |
|
Test build #113700 has finished for PR 26500 at commit
|
|
Test build #113711 has finished for PR 26500 at commit
|
|
Can you address more in the PR desc for better commit logs, how to optimize it? |
|
Ping @AngersZhuuuu |
Thank you ping, sorry for pending this work. A little busy these days. |
@cloud-fan @maropu @srowen |
|
great! can you enrich the PR description? |
Updated , is clear now? |
have you confirmed? groupby + count is one operator called Aggregate. |
Updated, |
|
thanks, merging to master! |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
### What changes were proposed in this pull request?
In origin way to judge if a DataSet is empty by
```
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
```
will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver.
In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.
We change it to
```
def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
plan.executeTake(1).isEmpty
}
```
After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side.
### Why are the changes needed?
Optimize Dataset.isEmpty()
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Origin UT
Closes apache#26500 from AngersZhuuuu/SPARK-29874.
Authored-by: angerszhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
In origin way to judge if a DataSet is empty by
```
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
```
will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver.
In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.
We change it to
```
def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
plan.executeTake(1).isEmpty
}
```
After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side.
### Why are the changes needed?
Optimize Dataset.isEmpty()
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Origin UT
Closes apache#26500 from AngersZhuuuu/SPARK-29874.
Authored-by: angerszhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In origin way to judge if a DataSet is empty by
will add two shuffles by
limit(),groupby() and count(), then collect all data to driver.In this way we can avoid
oomwhen collect data to driver. But it will trigger all partitions calculated and add more shuffle process.We change it to
After these pr, we will add a column pruning to origin LogicalPlan and use
executeTake()API.then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call
DataSet.isEmpty()and won't bring memory issue to driver side.Why are the changes needed?
Optimize Dataset.isEmpty()
Does this PR introduce any user-facing change?
No
How was this patch tested?
Origin UT