-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway #18339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is interesting, I've got a similar approach I've been working on in #17298 which has some issues inside of PyPI. Would that suit your needs if I extended it to allow you to enable it manually in addition to when the pipe was overloaded? Let me know. In the meantime, jenkins ok to test. |
|
Oh neat. #17298 looks similar to the approach we took in spylon-kernel to launch with stdout/stderr pipes redirected to the parent process and threads to read them (https://github.com/maxpoint/spylon-kernel/blob/master/spylon_kernel/scala_interpreter.py#L73). That project is based on Calysto/metakernel, which has an API for sending stdout/stderr back to kernel clients, so we use that instead of I still think it would be handy to give clients more control over how the py4j gateway is launched. For instance, if I want to use pyspark in an asyncio application, I might want to open pipes to the jvm process, but then switch them to non-blocking IO mode and hook them up to an async reader. If #17298 merges without a making the threads optional and exposing the pipes for the caller to use, it's likely to be more harmful than helpful in the async situation. |
|
The approach taken in https://github.com/maxpoint/spylon-kernel/blob/master/spylon_kernel/scala_interpreter.py#L73 is interesting (and definitely not supported) - so making it easier for kernels to get at the JVM logs as needed seems worthwhile. That being said if the messages are piped through from the JVM to the existing stderr/stdout pipes would that be sufficient? |
|
Jenkins ok to test. |
python/pyspark/java_gateway.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this _popen_kwargs to indicate it's usage is possibly not super supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a comment in the docstring to that effect be better? I haven't seen _var_name used in Python projects to indicate a developer feature. (But of course, maybe I've just not seen it yet!)
python/pyspark/java_gateway.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that this is a developer feature and may change in future versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And ... you already noted what I just commented above. Doh! I'll update the docstring at least.
|
Test build #81472 has finished for PR 18339 at commit
|
|
Let's get some extra eyes on this, maybe @davies or @HyukjinKwon want to take a quick look? I think it makes sense as an advanced developer API but I'm open to other ideas. |
|
Test build #81570 has finished for PR 18339 at commit
|
|
Thanks for cc'ing me. To me, I think I can follow the discussion and the motivation here but I think I am neutral (rather -0) as |
|
Jenkins OK to test. |
|
I am okay with going ahead @holdenk if you think it's okay anyway. |
|
retest this please |
|
Test build #83998 has finished for PR 18339 at commit
|
|
Lets see what @BryanCutler thinks |
|
ok to test |
|
Test build #91604 has finished for PR 18339 at commit
|
|
@HyukjinKwon what re-triggered your interest in this PR? |
|
Jenkins left a comment asking like "Can one of the admins verify this patch?" again. I was thinking it's worth given your comment above so I just triggered the build again .. I am not sure why / when / who about Jenkins leaving those comments again to some particular PRs. I was thinking about asking this into dev mailing list if happens one more time. |
|
Since @HyukjinKwon's concerns for this PR have been addressed if @parente can update this to master would be lovely to get this in for 3+ since I'm working on some multi-language pipeline stuff which could benefit. |
|
@holdenk Took a note to look at it this weekend. |
Allow the caller to customize the py4j JVM subprocess pipes and buffers for programmatic capturing of its output.
3ece21f to
fa63ba7
Compare
|
Test build #98174 has finished for PR 18339 at commit
|
|
Test build #98175 has finished for PR 18339 at commit
|
|
@holdenk I rebased the PR and I think it's good to go if you'd like to give it another look. |
|
Small bump in case this is still of interest for 3.x. |
|
The longer this PR has been open the more times I've seen the need for it, my bad on not coming back to this. Jenkins retest this please. |
|
For clarification, I am okay. no objection. |
|
Jenkins retest this please |
|
@parente if you could merge in master that would trigger a Jenkins run. |
|
Looks like Jenkins listened, everything passed so will merge to master. |
|
Test build #102407 has finished for PR 18339 at commit
|
|
Merged to master |
## What changes were proposed in this pull request? Allow the caller to customize the py4j JVM subprocess pipes and buffers for programmatic capturing of its output. https://issues.apache.org/jira/browse/SPARK-21094 has more detail about the use case. ## How was this patch tested? Tested by running the pyspark unit tests locally. Closes apache#18339 from parente/feature/SPARK-21094-popen-args. Lead-authored-by: Peter Parente <[email protected]> Co-authored-by: Peter Parente <[email protected]> Signed-off-by: Holden Karau <[email protected]>
What changes were proposed in this pull request?
Allow the caller to customize the py4j JVM subprocess pipes and buffers for programmatic capturing of its output.
https://issues.apache.org/jira/browse/SPARK-21094 has more detail about the use case.
How was this patch tested?
Tested by running the pyspark unit tests locally.