[SPARK-29874][SQL]Optimize Dataset.isEmpty() #26500

AngersZhuuuu · 2019-11-13T09:12:33Z

What changes were proposed in this pull request?

In origin way to judge if a DataSet is empty by

 def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
  }

will add two shuffles by limit(), groupby() and count(), then collect all data to driver.
In this way we can avoid oom when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.

We change it to

  def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
    plan.executeTake(1).isEmpty
  }

After these pr, we will add a column pruning to origin LogicalPlan and use executeTake() API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call DataSet.isEmpty() and won't bring memory issue to driver side.

Why are the changes needed?

Optimize Dataset.isEmpty()

Does this PR introduce any user-facing change?

No

How was this patch tested?

Origin UT

AngersZhuuuu · 2019-11-13T09:12:48Z

@cloud-fan
As we discussed #26437 (comment), in this way is better？

cloud-fan · 2019-11-13T11:36:27Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   */
-  def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
-    plan.executeCollect().head.getLong(0) == 0
+  def isEmpty: Boolean = withAction("isEmpty", queryExecution) { plan =>


shall we do column pruning?

shall we do column pruning?

Of course, add it.

cloud-fan · 2019-11-13T11:36:41Z

Can we have some benchmark numbers?

cloud-fan · 2019-11-13T11:36:51Z

ok to test

SparkQA · 2019-11-13T15:28:46Z

Test build #113700 has finished for PR 26500 at commit 56d2093.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T17:57:23Z

Test build #113711 has finished for PR 26500 at commit 908a39f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-14T01:00:14Z

Can you address more in the PR desc for better commit logs, how to optimize it?

srowen · 2019-11-20T14:24:14Z

Ping @AngersZhuuuu

AngersZhuuuu · 2019-11-20T14:48:37Z

Ping @AngersZhuuuu

Thank you ping, sorry for pending this work. A little busy these days.
Starting work on these things.

AngersZhuuuu · 2019-11-21T07:15:14Z

  test("benchmark of empty") {
    var start = System.currentTimeMillis()
    var isEmpty = spark.range(10000000)
      .repartition(100)
      .limit(1)
      .groupBy()
      .count()
      .queryExecution.executedPlan.executeCollect().head.getLong(0) == 0
    println(isEmpty)
    var end = System.currentTimeMillis()
    // scalastyle:off
    println(s"duration = ${end - start}")

    start = System.currentTimeMillis()
    isEmpty = spark.range(10000000)
      .repartition(100)
      .select()
      .queryExecution.executedPlan.executeTake(1) == 0
    println(isEmpty)
    end = System.currentTimeMillis()
    // scalastyle:off
    println(s"duration = ${end - start}")
  }

Result
false
duration = 7248
false
duration = 1449

@cloud-fan @maropu @srowen
The test case is simple but can mimic the behavior before and after the API change.

cloud-fan · 2019-11-21T08:26:37Z

great! can you enrich the PR description? Optimize Dataset.isEmpty() is good in the "Why" section but we need to put more in the "What" section. e.g. we change the implementation to avoid shuffles.

AngersZhuuuu · 2019-11-21T08:37:20Z

great! can you enrich the PR description? Optimize Dataset.isEmpty() is good in the "Why" section but we need to put more in the "What" section. e.g. we change the implementation to avoid shuffles.

Updated , is clear now?

cloud-fan · 2019-11-21T08:52:54Z

will add three shuffles by limit(), groupby() and count()

have you confirmed? groupby + count is one operator called Aggregate.

AngersZhuuuu · 2019-11-21T08:57:05Z

will add three shuffles by limit(), groupby() and count()

have you confirmed? groupby + count is one operator called Aggregate.

Updated, count won't trigger shuffle.

cloud-fan · 2019-11-21T10:43:37Z

thanks, merging to master!

HyukjinKwon

LGTM too

### What changes were proposed in this pull request? In origin way to judge if a DataSet is empty by ``` def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } ``` will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver. In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process. We change it to ``` def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan => plan.executeTake(1).isEmpty } ``` After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API. then we won't add more shuffle process and just compute only one partition's data in last stage. In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side. ### Why are the changes needed? Optimize Dataset.isEmpty() ### Does this PR introduce any user-facing change? No ### How was this patch tested? Origin UT Closes apache#26500 from AngersZhuuuu/SPARK-29874. Authored-by: angerszhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Update Dataset.scala

56d2093

cloud-fan reviewed Nov 13, 2019

View reviewed changes

Update Dataset.scala

908a39f

dongjoon-hyun added the SQL label Nov 13, 2019

cloud-fan closed this in 6146dc4 Nov 21, 2019

HyukjinKwon reviewed Nov 22, 2019

View reviewed changes

rahij mentioned this pull request Jan 10, 2020

[SPARK-29874][SQL]Optimize Dataset.isEmpty() palantir/spark#632

Merged

[SPARK-29874][SQL]Optimize Dataset.isEmpty() #26500

[SPARK-29874][SQL]Optimize Dataset.isEmpty() #26500

Uh oh!

Conversation

AngersZhuuuu commented Nov 13, 2019 • edited by cloud-fan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 13, 2019

Uh oh!

cloud-fan commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

maropu commented Nov 14, 2019

Uh oh!

srowen commented Nov 20, 2019

Uh oh!

AngersZhuuuu commented Nov 20, 2019

Uh oh!

AngersZhuuuu commented Nov 21, 2019

Uh oh!

cloud-fan commented Nov 21, 2019

Uh oh!

AngersZhuuuu commented Nov 21, 2019

Uh oh!

cloud-fan commented Nov 21, 2019

Uh oh!

AngersZhuuuu commented Nov 21, 2019

Uh oh!

cloud-fan commented Nov 21, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AngersZhuuuu commented Nov 13, 2019 •

edited by cloud-fan

Loading

AngersZhuuuu commented Nov 13, 2019 •

edited

Loading