Skip to content

Conversation

@parente
Copy link
Contributor

@parente parente commented Jun 17, 2017

What changes were proposed in this pull request?

Allow the caller to customize the py4j JVM subprocess pipes and buffers for programmatic capturing of its output.

https://issues.apache.org/jira/browse/SPARK-21094 has more detail about the use case.

How was this patch tested?

Tested by running the pyspark unit tests locally.

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

This is interesting, I've got a similar approach I've been working on in #17298 which has some issues inside of PyPI. Would that suit your needs if I extended it to allow you to enable it manually in addition to when the pipe was overloaded?

Let me know.

In the meantime, jenkins ok to test.

@parente
Copy link
Contributor Author

parente commented Jul 3, 2017

Oh neat. #17298 looks similar to the approach we took in spylon-kernel to launch with stdout/stderr pipes redirected to the parent process and threads to read them (https://github.com/maxpoint/spylon-kernel/blob/master/spylon_kernel/scala_interpreter.py#L73). That project is based on Calysto/metakernel, which has an API for sending stdout/stderr back to kernel clients, so we use that instead of print() like the PR here does.

I still think it would be handy to give clients more control over how the py4j gateway is launched. For instance, if I want to use pyspark in an asyncio application, I might want to open pipes to the jvm process, but then switch them to non-blocking IO mode and hook them up to an async reader. If #17298 merges without a making the threads optional and exposing the pipes for the caller to use, it's likely to be more harmful than helpful in the async situation.

@holdenk
Copy link
Contributor

holdenk commented Jul 6, 2017

The approach taken in https://github.com/maxpoint/spylon-kernel/blob/master/spylon_kernel/scala_interpreter.py#L73 is interesting (and definitely not supported) - so making it easier for kernels to get at the JVM logs as needed seems worthwhile. That being said if the messages are piped through from the JVM to the existing stderr/stdout pipes would that be sufficient?

@holdenk
Copy link
Contributor

holdenk commented Sep 6, 2017

Jenkins ok to test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this _popen_kwargs to indicate it's usage is possibly not super supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a comment in the docstring to that effect be better? I haven't seen _var_name used in Python projects to indicate a developer feature. (But of course, maybe I've just not seen it yet!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that this is a developer feature and may change in future versions.

Copy link
Contributor Author

@parente parente Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And ... you already noted what I just commented above. Doh! I'll update the docstring at least.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81472 has finished for PR 18339 at commit e9b7743.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 8, 2017

Let's get some extra eyes on this, maybe @davies or @HyukjinKwon want to take a quick look? I think it makes sense as an advanced developer API but I'm open to other ideas.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81570 has finished for PR 18339 at commit 3ece21f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Thanks for cc'ing me. To me, I think I can follow the discussion and the motivation here but I think I am neutral (rather -0) as launch_gateway itself looks undocumented and the workaround appears working fine. Will help review this one if any committer strongly prefers this anyway.

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2017

Jenkins OK to test.

@HyukjinKwon
Copy link
Member

I am okay with going ahead @holdenk if you think it's okay anyway.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 19, 2017

Test build #83998 has finished for PR 18339 at commit 3ece21f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2018

Lets see what @BryanCutler thinks

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91604 has finished for PR 18339 at commit 3ece21f.

  • This patch fails SparkR unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jun 28, 2018

@HyukjinKwon what re-triggered your interest in this PR?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 29, 2018

Jenkins left a comment asking like "Can one of the admins verify this patch?" again. I was thinking it's worth given your comment above so I just triggered the build again ..

I am not sure why / when / who about Jenkins leaving those comments again to some particular PRs. I was thinking about asking this into dev mailing list if happens one more time.

@holdenk
Copy link
Contributor

holdenk commented Oct 26, 2018

Since @HyukjinKwon's concerns for this PR have been addressed if @parente can update this to master would be lovely to get this in for 3+ since I'm working on some multi-language pipeline stuff which could benefit.

@parente
Copy link
Contributor Author

parente commented Oct 26, 2018

@holdenk Took a note to look at it this weekend.

parente and others added 3 commits October 28, 2018 21:29
Allow the caller to customize the py4j JVM subprocess
pipes and buffers for programmatic capturing of its
output.
@parente parente force-pushed the feature/SPARK-21094-popen-args branch from 3ece21f to fa63ba7 Compare October 29, 2018 01:42
@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98174 has finished for PR 18339 at commit fa63ba7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98175 has finished for PR 18339 at commit ea267c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@parente
Copy link
Contributor Author

parente commented Oct 30, 2018

@holdenk I rebased the PR and I think it's good to go if you'd like to give it another look.

@parente
Copy link
Contributor Author

parente commented Feb 9, 2019

Small bump in case this is still of interest for 3.x.

@holdenk
Copy link
Contributor

holdenk commented Feb 11, 2019

The longer this PR has been open the more times I've seen the need for it, my bad on not coming back to this. Jenkins retest this please.
Looks good to me pending jenkins.

@HyukjinKwon
Copy link
Member

For clarification, I am okay. no objection.

@holdenk
Copy link
Contributor

holdenk commented Feb 16, 2019

Jenkins retest this please

@holdenk
Copy link
Contributor

holdenk commented Feb 16, 2019

@parente if you could merge in master that would trigger a Jenkins run.

@holdenk
Copy link
Contributor

holdenk commented Feb 16, 2019

Looks like Jenkins listened, everything passed so will merge to master.

@SparkQA
Copy link

SparkQA commented Feb 16, 2019

Test build #102407 has finished for PR 18339 at commit ea267c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 16, 2019

Merged to master

@asfgit asfgit closed this in 3d6066e Feb 16, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Allow the caller to customize the py4j JVM subprocess pipes and buffers for programmatic capturing of its output.

https://issues.apache.org/jira/browse/SPARK-21094 has more detail about the use case.

## How was this patch tested?

Tested by running the pyspark unit tests locally.

Closes apache#18339 from parente/feature/SPARK-21094-popen-args.

Lead-authored-by: Peter Parente <[email protected]>
Co-authored-by: Peter Parente <[email protected]>
Signed-off-by: Holden Karau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants