-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2242] HOTFIX: pyspark shell hangs on simple job #1178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging. Thanks to @shivaram and @stogers for helping to fix this issue! Author: Kay Ousterhout <[email protected]> Closes #383 from kayousterhout/pyspark and squashes the following commits: 36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16016/ |
|
Jenkins test this please |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Sorry about this and thanks for fixing it @andrewor14 ! Just one change -- I think you should also delete proc.stderr.readLines() on line 57, since that won't return anything now. |
|
BTW did you look into using communicate() at all? If not I'll look into that to do a long-term fix later today. |
|
Hm no I have not, though I'm not sure if we can use |
|
Unless we do a |
|
Yeah the latter thing is what I was thinking |
Before: ValueError: invalid literal for int() with base 10 After: Launching GatewayServer failed because of stdout interference. Silence the following and try again.
|
Merged build triggered. |
|
Merged build started. |
|
Looks like if an exception is thrown because of casting the output to int (rather than reading the output of the process itself), then the process exit code returned is I haven't invested a ton of time on figuring out how to use |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16040/ |
|
Merged build triggered. |
|
Merged build started. |
|
@mattf The second issue that your changes don't address is that the existing code also masks stderr output, which contains important Spark logging information. Also, I tried your patch out, and |
|
Here's the output of the latest commit in the event of stdout interference: |
|
This looks perfect! On Tue, Jun 24, 2014 at 12:44 PM, andrewor14 [email protected]
|
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16064/ |
|
Jenkins, test this please |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16065/ |
|
Jenkins, test this please |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@andrewor14 what's the reproducer for the "hangs when an exception is thrown" case? |
|
@mattf try adding the following lines to What pyspark tries to do is to read the string "Hello. This goes to stdout..." as an int and throws an exception. I think whether it hangs depends on the environment, but on mine I ran into the deadlock the python docs warned against. |
|
@andrewor14 thanks, i've been able to reproduce a hang when spark-class outputs something other than the port # |
|
This looks good to me. I'm going merge it since pyspark is broken without this patch. |
This reverts a change introduced in 3870248, which redirected all stderr to the OS pipe instead of directly to the `bin/pyspark` shell output. This causes a simple job to hang in two ways: 1. If the cluster is not configured correctly or does not have enough resources, the job hangs without producing any output, because the relevant warning messages are masked. 2. If the stderr volume is large, this could lead to a deadlock if we redirect everything to the OS pipe. From the [python docs](https://docs.python.org/2/library/subprocess.html): ``` Note Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes. ``` Note that we cannot remove `stdout=PIPE` in a similar way, because we currently use it to communicate the py4j port. However, it should be fine (as it has been for a long time) because we do not produce a ton of traffic through `stdout`. That commit was not merged in branch-1.0, so this fix is for master only. Author: Andrew Or <[email protected]> Closes apache#1178 from andrewor14/fix-python and squashes the following commits: e68e870 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-python 20849a8 [Andrew Or] Tone down stdout interference message a09805b [Andrew Or] Return more than 1 line of error message to user 6dfbd1e [Andrew Or] Don't swallow original exception 0d1861f [Andrew Or] Provide more helpful output if stdout is garbled 21c9d7c [Andrew Or] Do not mask stderr from output
* [CARMEL-6426] Adjust scan parallelism dynamically improvement * Use ExplodeBase instead of Explode
…-36944 (…" (apache#1178) This reverts commit 4272502245483e4f7680c1b3faa8b02bf0e466c6.
This reverts a change introduced in 3870248, which redirected all stderr to the OS pipe instead of directly to the
bin/pysparkshell output. This causes a simple job to hang in two ways:Note that we cannot remove
stdout=PIPEin a similar way, because we currently use it to communicate the py4j port. However, it should be fine (as it has been for a long time) because we do not produce a ton of traffic throughstdout.That commit was not merged in branch-1.0, so this fix is for master only.