Skip to content

Conversation

@kayousterhout
Copy link
Contributor

If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.

Thanks to @shivaram and @stogers for helping to fix this issue!

@kayousterhout
Copy link
Contributor Author

This should be backported to 0.9 and 1.0

@kayousterhout
Copy link
Contributor Author

BTW this is a much bigger issue with iPython notebook -- if you're running in the console, you get the wrong error (with parsing the int) but also the correct error. If you're running in iPython notebook, you only get the wrong error, making this very annoying to debug.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14017/

@pwendell
Copy link
Contributor

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14069/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should say "Launching GatewayServer failed"? It will be more informative, otherwise people will think something is wrong with SparkContext itself.

@mateiz
Copy link
Contributor

mateiz commented Apr 18, 2014

@kayousterhout not sure if you saw my comment, this looks good but the exception message is somewhat confusing. It would be good to update that.

@pwendell
Copy link
Contributor

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@kayousterhout
Copy link
Contributor Author

I did but haven't had time to figure out why the tests are failing (the tests don't run properly on my laptop). Hoping this was a Jenkins issue and the re-launched tests pass.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14230/

Also include stderr output to help user debug startup issue.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15585/

@mateiz
Copy link
Contributor

mateiz commented Jun 10, 2014

It looks like the tests magically passed now! Is this good to go?

@kayousterhout
Copy link
Contributor Author

Yup! I just rebased so it should merge cleanly on master.

Sent from my iPhone

On Jun 9, 2014, at 7:03 PM, Matei Zaharia [email protected] wrote:

It looks like the tests magically passed now! Is this good to go?


Reply to this email directly or view it on GitHub.

@asfgit asfgit closed this in 3870248 Jun 18, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.

Thanks to @shivaram and @stogers for helping to fix this issue!

Author: Kay Ousterhout <[email protected]>

Closes apache#383 from kayousterhout/pyspark and squashes the following commits:

36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.

Thanks to @shivaram and @stogers for helping to fix this issue!

Author: Kay Ousterhout <[email protected]>

Closes apache#383 from kayousterhout/pyspark and squashes the following commits:

36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
tangzhankun pushed a commit to tangzhankun/spark that referenced this pull request Jul 25, 2017
…he#383)

This makes executors consistent with the driver. Note that
SPARK_EXTRA_CLASSPATH isn't set anywhere by Spark itself, but it's
primarily meant to be set by images that inherit from the base
driver/executor images.
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
…he#383)

This makes executors consistent with the driver. Note that
SPARK_EXTRA_CLASSPATH isn't set anywhere by Spark itself, but it's
primarily meant to be set by images that inherit from the base
driver/executor images.
mccheah added a commit to mccheah/spark that referenced this pull request Nov 28, 2018
Apply patches for SPARK-24531 to fix tests
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Diable S3 test cases in fusioncloud job
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…leanup (apache#383)

([Original PR](apache#46027))

### What changes were proposed in this pull request?

Expired sessions are regularly checked and cleaned up by a maintenance thread. However, currently, this process is synchronous. Therefore, in rare cases, interrupting the execution thread of a query in a session can take hours, causing the entire maintenance process to stall, resulting in a large amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, avoiding synchronous joins of execution threads, and preventing the maintenance thread from stalling in the above scenarios. To be more specific, instead of calling `runner.join()` in `ExecutorHolder.close()`, we set a post-cleanup function as the callback through `runner.processOnCompletion`, which will be called asynchronously once the execution runner is completed or interrupted. In this way, the maintenance thread won't get blocked on joining an execution thread.

### Why are the changes needed?

In the rare cases mentioned above, performance can be severely affected.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests and a new test `Async cleanup callback gets called after the execution is closed` in `ReattachableExecuteSuite.scala`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46064 from xi-db/SPARK-47819-async-cleanup-3.5.

Authored-by: Xi Lyu <[email protected]>

Signed-off-by: Herman van Hovell <[email protected]>
Co-authored-by: Xi Lyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants