-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20213][SQL][follow-up] introduce SQLExecution.ignoreNestedExecutionId #18419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Wrap an action which may have nested execution id. This method can be used to run an execution | ||
| * inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`. | ||
| */ | ||
| def ignoreNestedExecutionId[T](sparkSession: SparkSession)(body: => T): T = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we ignore nested execution id, the job stages and metrics created by the body here will still be recorded into the SQLExecutionUIData referred by the current execution id. But looks it should be fine.
|
|
||
| /** | ||
| * Wrap an action which may have nested execution id. This method can be used to run an execution | ||
| * inside another execution, e.g., `CacheTableCommand` need to call `Dataset.collect`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: All Spark jobs in the body won't be tracked in UI.
|
LGTM |
| // If `IGNORE_NESTED_EXECUTION_ID` is set, just ignore the execution id while evaluating the | ||
| // `body`, so that Spark jobs issued in the `body` won't be tracked. | ||
| try { | ||
| sc.setLocalProperty(EXECUTION_ID_KEY, null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya now we won't track the spark jobs even in SparkListener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
Test build #78600 has finished for PR 18419 at commit
|
|
LGTM |
|
Test build #78608 has finished for PR 18419 at commit
|
|
retest this please |
|
Test build #78616 has finished for PR 18419 at commit
|
|
retest this please |
|
Test build #78634 has finished for PR 18419 at commit
|
| // | ||
| // A real case is the `DataFrame.count` method. | ||
| throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set") | ||
| throw new IllegalArgumentException(s"$EXECUTION_ID_KEY is already set, please wrap your " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested execution is a developer problem, not a user problem. That's why the original PR did not throw IllegalArgumentException outside of testing. I think that should still be how this is handled.
If this is thrown at runtime, adding the text about ignoreNestedExecutionId is confusing for users, who can't (or shouldn't) set it. A comment is more appropriate if users will see this message. If the change to only throw during testing is added, then I think it is fine to add the text to the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLExecution is kind of a developer API, people who develop data source may need to call ignoreNestedExecutionId inside their data source implementation, as reading/writing data source will be run inside a command and they may hit the nested execution problem. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this is an easy error to hit and it shouldn't affect end users. It is better to warn that something is wrong than to fail a job that would otherwise succeed for a bug in Spark. As for the error message, I think it is fine if we intend to leave it in. I'd just rather not fail user jobs here.
I assume that DataSource developers will have tests, but probably not ones that know to set spark.testing. Is there a better way to detect test cases?
|
One minor comment, otherwise +1. |
|
The test failure is unrelated, thanks for the review, merging to master! @rdblue I'll address your comments in follow-up if you have any. |
## What changes were proposed in this pull request? This is kind of another follow-up for apache#18064 . In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API. This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18450 from cloud-fan/execution-id.
…utionId ## What changes were proposed in this pull request? in apache#18064, to work around the nested sql execution id issue, we introduced several internal methods in `Dataset`, like `collectInternal`, `countInternal`, `showInternal`, etc., to avoid nested execution id. However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in `Dataset`. Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing `SQLExecution.ignoreNestedExecutionId`. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with `SQLExecution.ignoreNestedExecutionId`, and this is more expansible than the previous approach. The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by rdblue ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18419 from cloud-fan/follow.
## What changes were proposed in this pull request? This is kind of another follow-up for apache#18064 . In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API. This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18450 from cloud-fan/execution-id.
What changes were proposed in this pull request?
in #18064, to work around the nested sql execution id issue, we introduced several internal methods in
Dataset, likecollectInternal,countInternal,showInternal, etc., to avoid nested execution id.However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in
Dataset.Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing
SQLExecution.ignoreNestedExecutionId. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action withSQLExecution.ignoreNestedExecutionId, and this is more expansible than the previous approach.The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by @rdblue
How was this patch tested?
existing tests.