Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

in #18064, to work around the nested sql execution id issue, we introduced several internal methods in Dataset, like collectInternal, countInternal, showInternal, etc., to avoid nested execution id.

However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in Dataset.

Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing SQLExecution.ignoreNestedExecutionId. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with SQLExecution.ignoreNestedExecutionId, and this is more expansible than the previous approach.

The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by @rdblue

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 26, 2017

cc @rdblue @gatorsmile @viirya

* Wrap an action which may have nested execution id. This method can be used to run an execution
* inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`.
*/
def ignoreNestedExecutionId[T](sparkSession: SparkSession)(body: => T): T = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we ignore nested execution id, the job stages and metrics created by the body here will still be recorded into the SQLExecutionUIData referred by the current execution id. But looks it should be fine.


/**
* Wrap an action which may have nested execution id. This method can be used to run an execution
* inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: All Spark jobs in the body won't be tracked in UI.

@viirya
Copy link
Member

viirya commented Jun 26, 2017

LGTM

// If `IGNORE_NESTED_EXECUTION_ID` is set, just ignore the execution id while evaluating the
// `body`, so that Spark jobs issued in the `body` won't be tracked.
try {
sc.setLocalProperty(EXECUTION_ID_KEY, null)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya now we won't track the spark jobs even in SparkListener

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78600 has finished for PR 18419 at commit 0795c16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jun 26, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78608 has finished for PR 18419 at commit cd6e3f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78616 has finished for PR 18419 at commit cd6e3f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78634 has finished for PR 18419 at commit cd6e3f0.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

//
// A real case is the `DataFrame.count` method.
throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set")
throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set, please wrap your " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested execution is a developer problem, not a user problem. That's why the original PR did not throw IllegalArgumentException outside of testing. I think that should still be how this is handled.

If this is thrown at runtime, adding the text about ignoreNestedExecutionId is confusing for users, who can't (or shouldn't) set it. A comment is more appropriate if users will see this message. If the change to only throw during testing is added, then I think it is fine to add the text to the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLExecution is kind of a developer API, people who develop data source may need to call ignoreNestedExecutionId inside their data source implementation, as reading/writing data source will be run inside a command and they may hit the nested execution problem. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this is an easy error to hit and it shouldn't affect end users. It is better to warn that something is wrong than to fail a job that would otherwise succeed for a bug in Spark. As for the error message, I think it is fine if we intend to leave it in. I'd just rather not fail user jobs here.

I assume that DataSource developers will have tests, but probably not ones that know to set spark.testing. Is there a better way to detect test cases?

@rdblue
Copy link
Contributor

rdblue commented Jun 26, 2017

One minor comment, otherwise +1.

@cloud-fan
Copy link
Contributor Author

The test failure is unrelated, thanks for the review, merging to master!

@rdblue I'll address your comments in follow-up if you have any.

@asfgit asfgit closed this in c228100 Jun 26, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?

This is kind of another follow-up for apache#18064 .

In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API.

This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#18450 from cloud-fan/execution-id.
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…utionId

## What changes were proposed in this pull request?

in apache#18064, to work around the nested sql execution id issue, we introduced several internal methods in `Dataset`, like `collectInternal`, `countInternal`, `showInternal`, etc., to avoid nested execution id.

However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in `Dataset`.

Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing `SQLExecution.ignoreNestedExecutionId`. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with `SQLExecution.ignoreNestedExecutionId`, and this is more expansible than the previous approach.

The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by rdblue

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#18419 from cloud-fan/follow.
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?

This is kind of another follow-up for apache#18064 .

In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API.

This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#18450 from cloud-fan/execution-id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants