-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12343][YARN] Simplify Yarn client and client argument #11603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to support environment variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No as I pointed out the other jira SPARK-3374 was proposing to remove. There might be some internal env variables used just to communicate between Client and AM but anything external can be removed for 2.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be done in SPARK-3374 as a separate patch to clean up all the internal env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be ok, but I think you have already removed most of them, it looks like there are only a couple left in this file so it might be just as easy to do here.
|
Test build #52743 has finished for PR 11603 at commit
|
|
Hmm, it looks like this overlaps a lot with https://issues.apache.org/jira/browse/SPARK-3374 which is to deprecate the env variables, remove deprecated configs, and remove the client args. Which someone is actively working on too, so I guess that kind of stinks to duplicate the work. Taking a quick skim this seems like you did everything proposed in that so we might as well dup these. Or actually why don't you just update this PR to have both jira in description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this and letting it use spark.jars uses the spark addJars method of distributing jars instead of using the yarn distributed cache like it did before. I'm not sure I want to change that, I would rather keep as much as possible consistent using the YARN distributed cache unless we have a reason to change.
If this means we need to add another config like spark.yarn.dist.jars I'm fine with that. We can leave it as undocumented for now if we want since spark.jars/spark.files isn't documented either, thought I'm not quite sure why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we shouldn't lose the distributed cache functionality.
On that topic, but probably orthogonal to this change, at some point I'd like to see yarn-client somehow also use the distributed cache for these jars. IIRC that doesn't happen at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Reading the rest of the change, if you figure out an easy way to have YARN's Client class handle spark.jars, you could solve both problems here... maybe some check in SparkContext to not distribute them in YARN mode?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks a lot for your suggestions.
|
Personally I'm fine with leaving some parameters in ClientArguments as long as they don't overlap with configs. The main thing is to not have to process the same config/setting multiple times. Here we either need the parameters or add in new configs. Since these are all private we can change it later if needed. |
|
Note I think the biggest thing here is to thoroughly test this to make sure all the various options work in both modes (cluster/client) with all spark-submit, pyspark, spark-shell, sparkr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass these as explicit command line arguments to the AM? Since we're simplifying code, might just take the chance and make the AM read these from the configuration too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin , so you mean we also need to simplify the command line argument for AM?
|
Part of SPARK-12343 was to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a core config, so should be declared in core/src/main/scala/org/apache/spark/internal/config/package.scala instead. Same for EXECUTOR_MEMORY, PY_FILES, and probably also EXECUTOR_CORES, although not completely sure about that last one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I put all the additional jars into a configuration spark.yarn.dist.jars, this will be picked by yarn/client and put into distributed cache. So now both in yarn client and cluster mode, additional jars will be put into distributed cache.
Another thing is that do we need to put user jar into distributed cache for yarn client mode, I think it is doable, not sure is there any special concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just leave that as is for now. We can file separate jira if we want to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So dist.files and dist.archives are public and documented, seems like we should make dist.jars public and document it also in the yarn docs unless someone has reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will add it to the yarn doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's another config "spark.jars" to handle this property, maybe we don't need to add another, and for dist.jars we could make it as internal use for yarn only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.jars is for distributing via spark internal mechanisms, this is done via the distributed cache, we should add it to the yarn only section of docs similar to the dist.files and dist.archives.
|
Test build #52825 has finished for PR 11603 at commit
|
|
Test build #53412 has finished for PR 11603 at commit
|
|
Test build #53664 has finished for PR 11603 at commit
|
|
Test build #53751 has finished for PR 11603 at commit
|
|
we should remove SPARK_LOG4J_CONF functionality in Client.scala have you looked at test failures? |
|
OK, I will remove this old one. Looks like this unit test is not related. |
|
Another environment is |
|
Test build #54025 has finished for PR 11603 at commit
|
|
We should file a separate jira for the SPARK_JAVA_OPTS since its used in core and other places. |
| private[spark] val DRIVER_USER_CLASS_PATH_FIRST = | ||
| ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.withDefault(false) | ||
|
|
||
| private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation is wrong
|
Mostly good, just a few comments. |
|
Test build #54420 has finished for PR 11603 at commit
|
|
Test build #54499 has finished for PR 11603 at commit
|
| (args.addJars, LocalResourceType.FILE, true), | ||
| (args.files, LocalResourceType.FILE, false), | ||
| (args.archives, LocalResourceType.ARCHIVE, false) | ||
| (sparkConf.get(JARS_TO_DISTRIBUTE).orNull, LocalResourceType.FILE, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think you could simplify this a little bit by listing just the options (instead of sparkConf.get(option).orNull); if you made the config entries lists defaulting to Nil (adding toSequence to their builders, and using withDefault instead of optional) you could save even more code below.
|
LGTM, I just left a really minor suggestion to save a few lines of code but no need to address that right now. |
|
Test build #54596 has finished for PR 11603 at commit
|
|
Merging to master, thanks! |
What changes were proposed in this pull request?
Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.
How was this patch tested?
This patch is tested manually with unit test.
CC @vanzin @tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove
ClientArguments, through refactoring some arguments like--class,--argare not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.