-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group #28395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
822448c to
1ead01d
Compare
|
Test build #121999 has finished for PR 28395 at commit
|
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a rather temporary workaround for PYSPARK_PIN_THREAD, and as a step to migrate to PYSPARK_PIN_THREAD smoothly. It targets Spark 3.0.
PYSPARK_PIN_THREADis unstable at this moment that affects whole PySpark applications.- It is impossible to make it runtime configuration as it has to be set before JVM is launched.
- There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.
Once PYSPARK_PIN_THREAD is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.
|
Test build #122023 has finished for PR 28395 at commit
|
|
retest this please |
|
Test build #122034 has finished for PR 28395 at commit
|
|
retest this please |
|
Test build #122047 has finished for PR 28395 at commit
|
|
retest this please |
|
Test build #122059 has finished for PR 28395 at commit
|
|
@WeichenXu123 the test failure seems legitimate. |
|
Test build #122095 has finished for PR 28395 at commit
|
|
retest this please |
|
Test build #122100 has finished for PR 28395 at commit
|
|
Test build #122120 has finished for PR 28395 at commit
|
|
Merged to master and branch-3.0. Thanks @mengxr, @WeichenXu123 and @dongjoon-hyun. |
…DD with user-specified job group
### What changes were proposed in this pull request?
I add a new API in pyspark RDD class:
def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)
This API do the same thing with `rdd.collect`, but it can specify the job group when do collect.
The purpose of adding this API is, if we use:
```
sc.setJobGroup("group-id...")
rdd.collect()
```
The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in
https://issues.apache.org/jira/browse/SPARK-31549
Note:
This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0.
- `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications.
- It is impossible to make it runtime configuration as it has to be set before JVM is launched.
- There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.
Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.
### Why are the changes needed?
Fix bug.
### Does this PR introduce any user-facing change?
A develop API in pyspark: `pyspark.RDD. collectWithJobGroup`
### How was this patch tested?
Unit test.
Closes #28395 from WeichenXu123/collect_with_job_group.
Authored-by: Weichen Xu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit ee1de66)
Signed-off-by: HyukjinKwon <[email protected]>
|
Hi, @HyukjinKwon . |
|
Hm, weird. It was a clean backport. Let me make a fix in the master through branch-3.0 to reduce the diff. Seems it's legitimate anyway. |
|
Thanks! |
|
Ah, it's branch-3.0 only. Let me just hotfix in branch-3.0 only. |
|
Thank you. The follow-up looks good. BTW, FYI, |
|
Thanks for letting me know. I will take a look too. |
|
Ah, it was already commented at #28194 :-) |
What changes were proposed in this pull request?
I add a new API in pyspark RDD class:
def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)
This API do the same thing with
rdd.collect, but it can specify the job group when do collect.The purpose of adding this API is, if we use:
The
setJobGroupAPI in pyspark won't work correctly. This related to a bug discussed inhttps://issues.apache.org/jira/browse/SPARK-31549
Note:
This PR is a rather temporary workaround for
PYSPARK_PIN_THREAD, and as a step to migrate toPYSPARK_PIN_THREADsmoothly. It targets Spark 3.0.PYSPARK_PIN_THREADis unstable at this moment that affects whole PySpark applications.Once
PYSPARK_PIN_THREADis enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.Why are the changes needed?
Fix bug.
Does this PR introduce any user-facing change?
A develop API in pyspark:
pyspark.RDD. collectWithJobGroupHow was this patch tested?
Unit test.