Skip to content

Conversation

@iRakson
Copy link
Contributor

@iRakson iRakson commented Dec 16, 2019

What changes were proposed in this pull request?

Added a Shutdown Hook in executor.scala which will ensure that executor's stop() method is always called.

Why are the changes needed?

In case executors are not going down gracefully, their stop() is not called.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually

@iRakson
Copy link
Contributor Author

iRakson commented Dec 16, 2019

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Thank you, @iRakson !

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115373 has finished for PR 26901 at commit b528283.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115383 has finished for PR 26901 at commit b528283.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Dec 16, 2019

I will check the issue.

@dongjoon-hyun
Copy link
Member

Thank you, @iRakson !

@vanzin
Copy link
Contributor

vanzin commented Dec 17, 2019

There are a bunch of OOM errors in the test output. They all seem to be on the driver side, which is not touched by this PR, so it's unclear to me how the change here could cause those failures.

I can't reproduce the OOM locally, but if someone can and is able to attach a debugger or get a heap dump, it could help.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115521 has finished for PR 26901 at commit b528283.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

According to the failure history, this PR always fails with Maven at the following. And, mostly, it causes non-OOM failures.

BarrierTaskContextSuite:
- global sync by barrier() call
- support multiple barrier() call within a single task *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(0, 1) finished unsuccessfully.

@vanzin
Copy link
Contributor

vanzin commented Dec 19, 2019

That's what is reported on the UI. But if you look at the actual unit-tests.log file, there are a bunch of OOM errors. From the latest timed out run:

===== TEST OUTPUT FOR o.a.s.scheduler.BarrierTaskContextSuite: 'support multiple barrier() call within a single task' =====

...

19/12/18 10:08:53.307 rpc-server-4270-2 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6150562811854402958, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to /127.0.0.1:43378; closing connection
io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:708)

@dongjoon-hyun
Copy link
Member

Oh, got it, @vanzin .

@dongjoon-hyun
Copy link
Member

Ping, @iRakson .

@iRakson
Copy link
Contributor Author

iRakson commented Feb 6, 2020

Retest this please.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117970 has finished for PR 26901 at commit b528283.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118006 has finished for PR 26901 at commit b528283.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

@dongjoon-hyun @vanzin
Please take a look at this approach for solving this issue.
Earlier i was adding a shutdown hook in the executor.scala which calls stop() method of executors.

Instead of adding a shutdown hook, now i am passing StopExecutor call to executors before killing them. This way they will stop themselves on their own before being killed.

I think this approach is better than forcing a shutdown hook.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121006 has finished for PR 26901 at commit c7e969a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

retest this please.

@iRakson iRakson changed the title [SPARK-29152][2.4][test-maven]Executor Plugin shutdown when dynamic allocation is enabled [SPARK-29152][2.4]Executor Plugin shutdown when dynamic allocation is enabled Apr 9, 2020
@dongjoon-hyun
Copy link
Member

Retest this please

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29152][2.4]Executor Plugin shutdown when dynamic allocation is enabled [SPARK-29152][CORE][2.4] Executor Plugin shutdown when dynamic allocation is enabled Apr 9, 2020
@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121038 has finished for PR 26901 at commit c7e969a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 13, 2020

gentle ping @dongjoon-hyun @vanzin

@dongjoon-hyun
Copy link
Member

If this one is better, shall we update master and branch-3.0 first?
We had better be consistent in all active branches if there is no reason.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 18, 2020

If this one is better, shall we update master and branch-3.0 first?
We had better be consistent in all active branches if there is no reason.

Yes. Actually that is what I was thinking as well. We should update it in master and 3.0 as well.

I have one doubt though.
Should I create a new JIRA and raise PR for that, I mean that way it will be better to track the changes ?

@dongjoon-hyun
Copy link
Member

Since 3.0 is not released yet, you can create a follow-up PR like [SPARK-29152][CORE][FOLLOWUP] on master branch.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 18, 2020

Since 3.0 is not released yet, you can create a follow-up PR like [[SPARK-29152](https://issues.apache.org/jira/browse/SPARK-29152)][CORE][FOLLOWUP] on master branch.

@dongjoon-hyun
I did not see this comment before creating another JIRA. :( I am extremely sorry.
Please review this PR

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123827 has finished for PR 26901 at commit 123f429.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Jun 11, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123832 has finished for PR 26901 at commit 123f429.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Jun 11, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123844 has finished for PR 26901 at commit 123f429.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Jun 11, 2020

retest this please

@iRakson
Copy link
Contributor Author

iRakson commented Jun 11, 2020

@dongjoon-hyun @vanzin I have tested this a couple of times, older approach of adding shutdownhook is working fine. Last failure is unrelated to this. Also about the other approach, I think Vanzin's comment is valid one. Covering all the executor exit scenarios can be difficult. This seems a better fix to our problem.

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123847 has finished for PR 26901 at commit 123f429.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented Jun 12, 2020

cc @srowen

@iRakson
Copy link
Contributor Author

iRakson commented Jun 14, 2020

@dongjoon-hyun @vanzin Can we get this in 2.4 ?
I think now this is fine. It should not break the code.
I have already tested this multiple times.

@dongjoon-hyun
Copy link
Member

Hi, @iRakson . So, this is your final decision and are you going to close #28254 ?
This PR looks like to be reverted into the original patch. In that case, we don't need #28254 , do we?

@iRakson
Copy link
Contributor Author

iRakson commented Jun 14, 2020

Hi, @iRakson . So, this is your final decision and are you going to close #28254 ?
This PR looks like to be reverted into the original patch. In that case, we don't need #28254 , do we?

Yeah, we don't need #28254 .

@iRakson
Copy link
Contributor Author

iRakson commented Jun 14, 2020

#28254 has been closed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @iRakson and @vanzin .
Merged to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Jun 14, 2020
…tion is enabled

### What changes were proposed in this pull request?
Added a Shutdown Hook in `executor.scala` which will ensure that executor's `stop()` method is always called.

### Why are the changes needed?
In case executors are not going down gracefully, their `stop()` is not called.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually

Closes #26901 from iRakson/SPARK-29152_2.4.

Authored-by: iRakson <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

The situation is the same. There are OOM errors.

java.lang.OutOfMemoryError: OOM while handling another exception
	at org.apache.spark.executor.FetchFailureHidingRDD.compute(ExecutorSuite.scala:377)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:416)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:422)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
20/06/14 16:21:15.737 ScalaTest-main-running-ExecutorSuite INFO ExecutorSuite: 

@dongjoon-hyun
Copy link
Member

Sorry again, @iRakson . I fully understand you spent lots of time to make this contribution and the failed test cases passed individually and locally. However, we cannot keep this patch in branch-2.4 because this will hide all the other failures.

I admit that this is my bad because I reviewed and merged the same PR twice and reverted. With two tries, we proved that the original patch has some issue with branch-2.4 definitely. Instead of putting more efforts on this, shall we move on to use new Apache Spark 3.0.0 since we have 3.0.0 release now? Technically, the original bug is a behavior since Apache Spark 2.4.0.

For now, I'll revert this commit to recover branch-2.4. The hang situation looks worse than the other failures.

@iRakson
Copy link
Contributor Author

iRakson commented Jun 15, 2020

@dongjoon-hyun Its behaviour is pretty confusing. But yeah, if this is breaking branch again then we should not keep it. Yes, this patch failed twice so we must move on.

Thank you for actively monitoring this patch. :) :)

@attilapiros
Copy link
Contributor

attilapiros commented Jun 3, 2021

I know why this caused OOM. Here is the reason and the fix: #32748

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants