Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 28, 2020

What changes were proposed in this pull request?

I add a new API in pyspark RDD class:

def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)

This API do the same thing with rdd.collect, but it can specify the job group when do collect.
The purpose of adding this API is, if we use:

sc.setJobGroup("group-id...")
rdd.collect()

The setJobGroup API in pyspark won't work correctly. This related to a bug discussed in
https://issues.apache.org/jira/browse/SPARK-31549

Note:

This PR is a rather temporary workaround for PYSPARK_PIN_THREAD, and as a step to migrate to PYSPARK_PIN_THREAD smoothly. It targets Spark 3.0.

  • PYSPARK_PIN_THREAD is unstable at this moment that affects whole PySpark applications.
  • It is impossible to make it runtime configuration as it has to be set before JVM is launched.
  • There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once PYSPARK_PIN_THREAD is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

A develop API in pyspark: pyspark.RDD. collectWithJobGroup

How was this patch tested?

Unit test.

@WeichenXu123 WeichenXu123 force-pushed the collect_with_job_group branch from 822448c to 1ead01d Compare April 28, 2020 13:31
@SparkQA
Copy link

SparkQA commented Apr 28, 2020

Test build #121999 has finished for PR 28395 at commit 1ead01d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr mengxr requested a review from HyukjinKwon April 28, 2020 19:59
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a rather temporary workaround for PYSPARK_PIN_THREAD, and as a step to migrate to PYSPARK_PIN_THREAD smoothly. It targets Spark 3.0.

  • PYSPARK_PIN_THREAD is unstable at this moment that affects whole PySpark applications.
  • It is impossible to make it runtime configuration as it has to be set before JVM is launched.
  • There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once PYSPARK_PIN_THREAD is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122023 has finished for PR 28395 at commit 481bba6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122034 has finished for PR 28395 at commit 481bba6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122047 has finished for PR 28395 at commit 481bba6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122059 has finished for PR 28395 at commit 481bba6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@WeichenXu123 the test failure seems legitimate.

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122095 has finished for PR 28395 at commit 91cf1c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122100 has finished for PR 28395 at commit 91cf1c6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122120 has finished for PR 28395 at commit bdd77fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0. Thanks @mengxr, @WeichenXu123 and @dongjoon-hyun.

HyukjinKwon pushed a commit that referenced this pull request May 1, 2020
…DD with user-specified job group

### What changes were proposed in this pull request?
I add a new API in pyspark RDD class:

def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)

This API do the same thing with `rdd.collect`, but it can specify the job group when do collect.
The purpose of adding this API is, if we use:

```
sc.setJobGroup("group-id...")
rdd.collect()
```
The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in
https://issues.apache.org/jira/browse/SPARK-31549

Note:

This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to  `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0.

- `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications.
- It is impossible to make it runtime configuration as it has to be set before JVM is launched.
- There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

### Why are the changes needed?
Fix bug.

### Does this PR introduce any user-facing change?
A develop API in pyspark: `pyspark.RDD. collectWithJobGroup`

### How was this patch tested?
Unit test.

Closes #28395 from WeichenXu123/collect_with_job_group.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit ee1de66)
Signed-off-by: HyukjinKwon <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 1, 2020

Hi, @HyukjinKwon .
Could you fix Python linter error on branch-3.0?

./python/pyspark/tests/test_rdd.py:787:5: E303 too many blank lines (2)

@HyukjinKwon
Copy link
Member

Hm, weird. It was a clean backport. Let me make a fix in the master through branch-3.0 to reduce the diff. Seems it's legitimate anyway.

@dongjoon-hyun
Copy link
Member

Thanks!

@HyukjinKwon
Copy link
Member

Ah, it's branch-3.0 only. Let me just hotfix in branch-3.0 only.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 1, 2020

Thank you. The follow-up looks good. BTW, FYI, branch-3.0 UT has been broken by another commit.

@HyukjinKwon
Copy link
Member

Thanks for letting me know. I will take a look too.

@HyukjinKwon
Copy link
Member

Ah, it was already commented at #28194 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants