[SPARK-20213][SQL][follow-up] introduce SQLExecution.ignoreNestedExecutionId #18419

cloud-fan · 2017-06-26T04:45:03Z

What changes were proposed in this pull request?

in #18064, to work around the nested sql execution id issue, we introduced several internal methods in Dataset, like collectInternal, countInternal, showInternal, etc., to avoid nested execution id.

However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in Dataset.

Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing SQLExecution.ignoreNestedExecutionId. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with SQLExecution.ignoreNestedExecutionId, and this is more expansible than the previous approach.

The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by @rdblue

How was this patch tested?

existing tests.

cloud-fan · 2017-06-26T04:45:16Z

cc @rdblue @gatorsmile @viirya

viirya · 2017-06-26T05:21:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+   * Wrap an action which may have nested execution id. This method can be used to run an execution
+   * inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`.
+   */
+  def ignoreNestedExecutionId[T](sparkSession: SparkSession)(body: => T): T = {


Although we ignore nested execution id, the job stages and metrics created by the body here will still be recorded into the SQLExecutionUIData referred by the current execution id. But looks it should be fine.

viirya · 2017-06-26T05:35:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+
+  /**
+   * Wrap an action which may have nested execution id. This method can be used to run an execution
+   * inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`.


nit: All Spark jobs in the body won't be tracked in UI.

viirya · 2017-06-26T05:36:28Z

LGTM

cloud-fan · 2017-06-26T06:04:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+      // If `IGNORE_NESTED_EXECUTION_ID` is set, just ignore the execution id while evaluating the
+      // `body`, so that Spark jobs issued in the `body` won't be tracked.
+      try {
+        sc.setLocalProperty(EXECUTION_ID_KEY, null)


@viirya now we won't track the spark jobs even in SparkListener

Looks good.

SparkQA · 2017-06-26T06:18:41Z

Test build #78600 has finished for PR 18419 at commit 0795c16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-26T06:28:41Z

LGTM

SparkQA · 2017-06-26T07:04:58Z

Test build #78608 has finished for PR 18419 at commit cd6e3f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-26T07:40:23Z

retest this please

SparkQA · 2017-06-26T09:13:58Z

Test build #78616 has finished for PR 18419 at commit cd6e3f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-26T12:52:59Z

retest this please

SparkQA · 2017-06-26T14:45:28Z

Test build #78634 has finished for PR 18419 at commit cd6e3f0.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-06-26T16:38:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

      //
      // A real case is the `DataFrame.count` method.
-      throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set")
+      throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set, please wrap your " +


Nested execution is a developer problem, not a user problem. That's why the original PR did not throw IllegalArgumentException outside of testing. I think that should still be how this is handled.

If this is thrown at runtime, adding the text about ignoreNestedExecutionId is confusing for users, who can't (or shouldn't) set it. A comment is more appropriate if users will see this message. If the change to only throw during testing is added, then I think it is fine to add the text to the exception.

SQLExecution is kind of a developer API, people who develop data source may need to call ignoreNestedExecutionId inside their data source implementation, as reading/writing data source will be run inside a command and they may hit the nested execution problem. What do you think?

The problem is that this is an easy error to hit and it shouldn't affect end users. It is better to warn that something is wrong than to fail a job that would otherwise succeed for a bug in Spark. As for the error message, I think it is fine if we intend to leave it in. I'd just rather not fail user jobs here.

I assume that DataSource developers will have tests, but probably not ones that know to set spark.testing. Is there a better way to detect test cases?

rdblue · 2017-06-26T16:39:22Z

One minor comment, otherwise +1.

cloud-fan · 2017-06-26T18:36:44Z

The test failure is unrelated, thanks for the review, merging to master!

@rdblue I'll address your comments in follow-up if you have any.

## What changes were proposed in this pull request? This is kind of another follow-up for apache#18064 . In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API. This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18450 from cloud-fan/execution-id.

…utionId ## What changes were proposed in this pull request? in apache#18064, to work around the nested sql execution id issue, we introduced several internal methods in `Dataset`, like `collectInternal`, `countInternal`, `showInternal`, etc., to avoid nested execution id. However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in `Dataset`. Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing `SQLExecution.ignoreNestedExecutionId`. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with `SQLExecution.ignoreNestedExecutionId`, and this is more expansible than the previous approach. The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by rdblue ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18419 from cloud-fan/follow.

## What changes were proposed in this pull request? This is kind of another follow-up for apache#18064 . In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API. This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18450 from cloud-fan/execution-id.

introduce SQLExecution.ignoreNestedExecutionId

0795c16

viirya reviewed Jun 26, 2017

View reviewed changes

address comments

cd6e3f0

cloud-fan commented Jun 26, 2017

View reviewed changes

rdblue reviewed Jun 26, 2017

View reviewed changes

asfgit closed this in c228100 Jun 26, 2017

cloud-fan mentioned this pull request Jun 28, 2017

[SPARK-21238][SQL] allow nested SQL execution #18450

Closed

[SPARK-20213][SQL][follow-up] introduce SQLExecution.ignoreNestedExecutionId #18419

[SPARK-20213][SQL][follow-up] introduce SQLExecution.ignoreNestedExecutionId #18419

Uh oh!

Conversation

cloud-fan commented Jun 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 26, 2017

Uh oh!

cloud-fan Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 26, 2017

Uh oh!

viirya commented Jun 26, 2017

Uh oh!

SparkQA commented Jun 26, 2017

Uh oh!

cloud-fan commented Jun 26, 2017

Uh oh!

SparkQA commented Jun 26, 2017

Uh oh!

cloud-fan commented Jun 26, 2017

Uh oh!

SparkQA commented Jun 26, 2017

Uh oh!

rdblue Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 27, 2017

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 26, 2017

Uh oh!

cloud-fan commented Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented Jun 26, 2017 •

edited

Loading