Skip to content

Conversation

@mridulm
Copy link
Contributor

@mridulm mridulm commented Jul 6, 2015

No description provided.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36599 has finished for PR 7243 at commit a3a0f01.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 6, 2015

Some stupid merge issues in the history of this PR - but hopefully should be fine now.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 6, 2015

@pwendell weird, the tests are still going on but there is a FAILed note ?

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36609 has finished for PR 7243 at commit 362d64a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

jenkins test this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36633 has finished for PR 7243 at commit 9218fcc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

The failures (in YarnClusterSuite) are unrelated to this change - and happen with a clean checkout as well.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

@tgravescs pls take a look, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? This method is called before the allocator field is initialized. From runDriver, for example:

// This a bit hacky, but we need to wait until the spark.driver.port property has
// been set by the Thread executing the user class.
val sc = waitForSparkContextInitialized()

// If there is no SparkContext at this point, just fail the app.
if (sc == null) {
  ...
} else {
  ....
  registerAM(rpcEnv, sc.ui.map(_.appUIAddress).getOrElse(""), securityMgr) /* this is where allocator is initialized */
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point - I think there is some merge issue here since it is working for us in cluster mode (hope it is not specific to client case ?).
Let me revisit the PR - apologies if this is a merge mess up !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be lucky, because the notify above wakes up the thread that initializes the allocator. But relying on luck doesn't sound like a good solution. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this fixes it ? Do let me know in case I am missing something - it has been more than 5+ releases since I looked at yarn module !

@tgravescs
Copy link
Contributor

nice catch mridul. The logic looks fine other then the note from @vanzin.

@vanzin
Copy link
Contributor

vanzin commented Jul 7, 2015

What about yarn-client? ApplicationMaster already has a way to talk to the driver regardless of the deploy mode (see AMEndpoint, and the equivalent on the driver side, YarnSchedulerBackend::YarnSchedulerEndpoint).

You could just send a message to the driver and have YarnSchedulerBackend handle the removeExecutor call.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

@vanzin The endpoint is for messages from core to yarn if I am not wrong ?
This would flow the other way - from yarn to core.

Or am I missing something ?

@vanzin
Copy link
Contributor

vanzin commented Jul 7, 2015

The endpoint is for messages from core to yarn if I am not wrong ?

If you look at AMEndpoint, it has a reference to the driver endpoint, so you should be able to send messages to the driver. (For example, check the onStart method which sends a message to the driver endpoint.)

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

Hmm, I think I will punt on fixing the client and defend against NPE's with null checks - and leave it to someone more familiar with yarn-client mode to patch the support in.

Fixing the merge issue (our changes are on 1.3 and the port to master is causing the issues :-) ) and submitting momentarily.

@vanzin
Copy link
Contributor

vanzin commented Jul 7, 2015

Hmm, I think I will punt on fixing the client and defend against NPE's with null checks

The thing is, if you use the endpoint route, the same code would work for both client and cluster. You don't need to special case anything. You'd just send a message to the driver saying "removeExecutor" and the driver would handle it, regardless of where it is. No plumbing of the "backend" variable anywhere.

Your current code is explicitly coupled to cluster mode and cannot, ever, work in client mode. So fixing this for client mode would mean reverting your changes. So why not do the final change right now?

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

The bug was filed a month back - and is a fairly critical issue but was seeing no progress/resolution.
This PR is a forward port of my fix from 1.3 to master from our internal codebase.
1.3 does not have the actor, etc changes - and since we do not use client mode in my team; the fix does not address those.

I am perfectly fine if anyone wants to use this as a template and fix it in a more principled manner aligned with master; in which case I can close the PR.

@vanzin
Copy link
Contributor

vanzin commented Jul 7, 2015

It's ok if you don't want to work on the full fix, I'm just wary of pushing a change to master that only fixes part of the problem, when the change to fix everything would be even smaller (and, in my view, cleaner). But maybe others feel differently.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 7, 2015

@vanzin I do not disagree with you - it would indeed be cleaner, and I do agree with your point of view.
I had not looked at client mode when I fixed this, and my assumption that it would work for both was not correct - but as you pointed out.

So currently this change works only for yarn cluster mode - while still of value, it would require some rework when client mode needs to be fixed.

Unfortunately I am on 1.3 on cluster mode - and I cannot move to master or test against that.

Which is why I am perfectly fine with closing this PR in case anyone wants to fix it in a more principled manner. I just dont want the issue to not get resolved before next release (we missed it for current already).

@vanzin
Copy link
Contributor

vanzin commented Jul 7, 2015

If you give me a few days I can take over your patch and update it (just don't delete your branch). But at this moment I'm swamped with other things here... :-/

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36726 has finished for PR 7243 at commit 687790f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor Author

mridulm commented Jul 8, 2015

@vanzin Oh great, that is a very generous offer !
I will keep it around as long as you need :-) I just want to get it into the next release.

@tgravescs
Copy link
Contributor

@mridulm can you close this then in favor of #7431

@mridulm mridulm closed this Jul 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants