[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group #28395

WeichenXu123 · 2020-04-28T13:29:20Z

What changes were proposed in this pull request?

I add a new API in pyspark RDD class:

def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)

This API do the same thing with rdd.collect, but it can specify the job group when do collect.
The purpose of adding this API is, if we use:

sc.setJobGroup("group-id...")
rdd.collect()

The setJobGroup API in pyspark won't work correctly. This related to a bug discussed in
https://issues.apache.org/jira/browse/SPARK-31549

Note:

This PR is a rather temporary workaround for PYSPARK_PIN_THREAD, and as a step to migrate to PYSPARK_PIN_THREAD smoothly. It targets Spark 3.0.

PYSPARK_PIN_THREAD is unstable at this moment that affects whole PySpark applications.
It is impossible to make it runtime configuration as it has to be set before JVM is launched.
There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once PYSPARK_PIN_THREAD is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

A develop API in pyspark: pyspark.RDD. collectWithJobGroup

How was this patch tested?

Unit test.

SparkQA · 2020-04-28T17:07:18Z

Test build #121999 has finished for PR 28395 at commit 1ead01d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

python/pyspark/rdd.py

python/pyspark/tests/test_rdd.py

HyukjinKwon

LGTM as a rather temporary workaround for PYSPARK_PIN_THREAD, and as a step to migrate to PYSPARK_PIN_THREAD smoothly. It targets Spark 3.0.

PYSPARK_PIN_THREAD is unstable at this moment that affects whole PySpark applications.
It is impossible to make it runtime configuration as it has to be set before JVM is launched.
There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once PYSPARK_PIN_THREAD is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

SparkQA · 2020-04-29T03:34:29Z

Test build #122023 has finished for PR 28395 at commit 481bba6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-29T04:13:16Z

retest this please

SparkQA · 2020-04-29T07:05:02Z

Test build #122034 has finished for PR 28395 at commit 481bba6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-29T07:44:01Z

retest this please

SparkQA · 2020-04-29T10:13:03Z

Test build #122047 has finished for PR 28395 at commit 481bba6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-29T11:16:43Z

retest this please

SparkQA · 2020-04-29T13:48:06Z

Test build #122059 has finished for PR 28395 at commit 481bba6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-29T14:01:56Z

@WeichenXu123 the test failure seems legitimate.

python/pyspark/rdd.py

SparkQA · 2020-04-30T01:52:02Z

Test build #122095 has finished for PR 28395 at commit 91cf1c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-30T01:53:21Z

retest this please

SparkQA · 2020-04-30T04:08:48Z

Test build #122100 has finished for PR 28395 at commit 91cf1c6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-30T10:31:57Z

Test build #122120 has finished for PR 28395 at commit bdd77fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-01T01:07:53Z

Merged to master and branch-3.0. Thanks @mengxr, @WeichenXu123 and @dongjoon-hyun.

…DD with user-specified job group ### What changes were proposed in this pull request? I add a new API in pyspark RDD class: def collectWithJobGroup(self, groupId, description, interruptOnCancel=False) This API do the same thing with `rdd.collect`, but it can specify the job group when do collect. The purpose of adding this API is, if we use: ``` sc.setJobGroup("group-id...") rdd.collect() ``` The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in https://issues.apache.org/jira/browse/SPARK-31549 Note: This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0. - `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications. - It is impossible to make it runtime configuration as it has to be set before JVM is launched. - There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability. Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1. ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? A develop API in pyspark: `pyspark.RDD. collectWithJobGroup` ### How was this patch tested? Unit test. Closes #28395 from WeichenXu123/collect_with_job_group. Authored-by: Weichen Xu <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit ee1de66) Signed-off-by: HyukjinKwon <[email protected]>

dongjoon-hyun · 2020-05-01T03:01:00Z

Hi, @HyukjinKwon .
Could you fix Python linter error on branch-3.0?

https://github.com/apache/spark/commits/branch-3.0

./python/pyspark/tests/test_rdd.py:787:5: E303 too many blank lines (2)

HyukjinKwon · 2020-05-01T03:09:37Z

Hm, weird. It was a clean backport. Let me make a fix in the master through branch-3.0 to reduce the diff. Seems it's legitimate anyway.

dongjoon-hyun · 2020-05-01T03:10:40Z

Thanks!

HyukjinKwon · 2020-05-01T03:12:23Z

Ah, it's branch-3.0 only. Let me just hotfix in branch-3.0 only.

dongjoon-hyun · 2020-05-01T03:20:28Z

Thank you. The follow-up looks good. BTW, FYI, branch-3.0 UT has been broken by another commit.

HyukjinKwon · 2020-05-01T03:29:34Z

Thanks for letting me know. I will take a look too.

HyukjinKwon · 2020-05-01T03:32:08Z

Ah, it was already commented at #28194 :-)

probot-autolabeler bot added CORE PYTHON labels Apr 28, 2020

init

1ead01d

WeichenXu123 force-pushed the collect_with_job_group branch from 822448c to 1ead01d Compare April 28, 2020 13:31

mengxr requested a review from HyukjinKwon April 28, 2020 19:59

mengxr reviewed Apr 28, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 29, 2020

View reviewed changes

python/pyspark/rdd.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 29, 2020

View reviewed changes

python/pyspark/tests/test_rdd.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Apr 29, 2020

View reviewed changes

WeichenXu123 added 2 commits April 29, 2020 09:12

update

6f80066

update

481bba6

dongjoon-hyun reviewed Apr 29, 2020

View reviewed changes

python/pyspark/rdd.py Show resolved Hide resolved

fix test

91cf1c6

fix

bdd77fe

HyukjinKwon closed this in ee1de66 May 1, 2020

zero323 mentioned this pull request May 18, 2020

[SPARK-31549] Add a develop API invoking collect on Python RDD with user-specified job group zero323/pyspark-stubs#405

Closed

zero323 mentioned this pull request Aug 29, 2020

[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group zero323/pyspark-stubs#476

Closed

[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group #28395

[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group #28395

Uh oh!

Conversation

WeichenXu123 commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

HyukjinKwon commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

HyukjinKwon commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

HyukjinKwon commented Apr 29, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

HyukjinKwon commented Apr 29, 2020

Uh oh!

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

HyukjinKwon commented Apr 30, 2020

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

dongjoon-hyun commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

dongjoon-hyun commented May 1, 2020

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

dongjoon-hyun commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WeichenXu123 commented Apr 28, 2020 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun commented May 1, 2020 •

edited

Loading

dongjoon-hyun commented May 1, 2020 •

edited

Loading