Skip to content

Conversation

@jinxing64
Copy link

@jinxing64 jinxing64 commented May 24, 2018

What changes were proposed in this pull request?

After #20014, Spark won't fails the entire executor but only fails the task suffering SparkOutOfMemoryError. After #21342, BroadcastExchangeExec try-catch OutOfMemoryError. Think about below scenario:

  1. SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in scala.concurrent.Future;
  2. SparkOutOfMemoryError is caught and an OutOfMemoryError is wrapped in SparkFatalException and re-thrown;
  3. ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError is thrown;
  4. The OutOfMemoryError will go to SparkUncaughtExceptionHandler.uncaughtException and Executor fails.

So it makes more sense to catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.

… re-throw SparkFatalException, which wraps SparkOutOfMemory inside.
@jinxing64
Copy link
Author

cc @cloud-fan @JoshRosen
Would you please help take a look at this when you have time ?

case oe: SparkOutOfMemoryError =>
throw new SparkFatalException(
new OutOfMemoryError(s"Not enough memory to build and broadcast the table to " +
new SparkOutOfMemoryError(s"Not enough memory to build and broadcast the table to " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we fully control the creation of SparkOutOfMemoryError, can we move the error message to where we throw SparkOutOfMemoryError when building hash relation?

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91105 has finished for PR 21424 at commit aa10470.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2018

Test build #91204 has finished for PR 21424 at commit 3a9669c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2018

Test build #91203 has finished for PR 21424 at commit 8f224fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
} catch {
case oe: SparkOutOfMemoryError =>
throw new SparkOutOfMemoryError(s"If this SparkOutOfMemoryError happens in Spark driver," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see. So the SparkOutOfMemoryError is thrown by BytesToBytesMap, we need to catch and rethrow it to attach the error message anyway.

I also found that we may throw OOM when calling child.executeCollectIterator which calls RDD#collect, seems the previous code is corrected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I change back ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we don't need to change anything, maybe just add some comments to say where OOM can occur, i.e. RDD#collect and BroadcastMode#transform

@jinxing64
Copy link
Author

jinxing64 commented May 28, 2018

@cloud-fan

I also found that we may throw OOM

My previous understanding is that Spark throw SparkOutOfMemoryError when expect there's no memory -- such expectation is from Spark scope of memory management. So that it's safe to catch SparkOutOfMemoryError and mark the corresponding task as failed rather than the executor.

If OutOfMemoryError is thrown from JVM internal, e.g. OOM when new Array[bigSize], is it ok to catch it and continue running the executor as if nothing happened ? If it's ok, does it mean that an executor should never exit when OutOfMemoryError?

@cloud-fan
Copy link
Contributor

That's a good point!

According to the document of SparkOutOfMemoryError, Spark should not kill the executor if SparkOutOfMemoryError is thrown. Broadcast is special because it's run on driver side, so SparkOutOfMemoryError is same as OutOfMemoryError, we need to kill the driver anyway.

That said, I think it's fine to catch OutOfMemoryError and enhance the error message here, to give users some details about why the job failed.

One thing we can do for future safety: do not change the exception type when enhancing the error message.

@jinxing64
Copy link
Author

Broadcast is special because it's run on driver side, so SparkOutOfMemoryError is same as OutOfMemoryError, we need to kill the driver anyway.

Thanks a lot for deep explanation, I will close this pr and leave it as unchanged :)

@jinxing64 jinxing64 closed this May 28, 2018
@HyukjinKwon
Copy link
Member

I left the JIRA resolved as "Won't fix" given the discussion above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants