Skip to content

Conversation

@jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.

How was this patch tested?

This patch is tested manually with unit test.

CC @vanzin @tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove ClientArguments, through refactoring some arguments like --class, --arg are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.

@jerryshao jerryshao changed the title [SPARK-12343] Simplify Yarn client and client argument [SPARK-12343][YARN] Simplify Yarn client and client argument Mar 9, 2016
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to support environment variables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No as I pointed out the other jira SPARK-3374 was proposing to remove. There might be some internal env variables used just to communicate between Client and AM but anything external can be removed for 2.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be done in SPARK-3374 as a separate patch to clean up all the internal env variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be ok, but I think you have already removed most of them, it looks like there are only a couple left in this file so it might be just as easy to do here.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52743 has finished for PR 11603 at commit 0c0a95c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

Hmm, it looks like this overlaps a lot with https://issues.apache.org/jira/browse/SPARK-3374 which is to deprecate the env variables, remove deprecated configs, and remove the client args. Which someone is actively working on too, so I guess that kind of stinks to duplicate the work.

Taking a quick skim this seems like you did everything proposed in that so we might as well dup these. Or actually why don't you just update this PR to have both jira in description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this and letting it use spark.jars uses the spark addJars method of distributing jars instead of using the yarn distributed cache like it did before. I'm not sure I want to change that, I would rather keep as much as possible consistent using the YARN distributed cache unless we have a reason to change.

If this means we need to add another config like spark.yarn.dist.jars I'm fine with that. We can leave it as undocumented for now if we want since spark.jars/spark.files isn't documented either, thought I'm not quite sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we shouldn't lose the distributed cache functionality.

On that topic, but probably orthogonal to this change, at some point I'd like to see yarn-client somehow also use the distributed cache for these jars. IIRC that doesn't happen at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reading the rest of the change, if you figure out an easy way to have YARN's Client class handle spark.jars, you could solve both problems here... maybe some check in SparkContext to not distribute them in YARN mode?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks a lot for your suggestions.

@tgravescs
Copy link
Contributor

Personally I'm fine with leaving some parameters in ClientArguments as long as they don't overlap with configs. The main thing is to not have to process the same config/setting multiple times. Here we either need the parameters or add in new configs. Since these are all private we can change it later if needed.

@tgravescs
Copy link
Contributor

Note I think the biggest thing here is to thoroughly test this to make sure all the various options work in both modes (cluster/client) with all spark-submit, pyspark, spark-shell, sparkr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass these as explicit command line arguments to the AM? Since we're simplifying code, might just take the chance and make the AM read these from the configuration too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin , so you mean we also need to simplify the command line argument for AM?

@vanzin
Copy link
Contributor

vanzin commented Mar 9, 2016

Part of SPARK-12343 was to make object Client private, which is missing from this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a core config, so should be declared in core/src/main/scala/org/apache/spark/internal/config/package.scala instead. Same for EXECUTOR_MEMORY, PY_FILES, and probably also EXECUTOR_CORES, although not completely sure about that last one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I put all the additional jars into a configuration spark.yarn.dist.jars, this will be picked by yarn/client and put into distributed cache. So now both in yarn client and cluster mode, additional jars will be put into distributed cache.

Another thing is that do we need to put user jar into distributed cache for yarn client mode, I think it is doable, not sure is there any special concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just leave that as is for now. We can file separate jira if we want to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So dist.files and dist.archives are public and documented, seems like we should make dist.jars public and document it also in the yarn docs unless someone has reason not to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add it to the yarn doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's another config "spark.jars" to handle this property, maybe we don't need to add another, and for dist.jars we could make it as internal use for yarn only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.jars is for distributing via spark internal mechanisms, this is done via the distributed cache, we should add it to the yarn only section of docs similar to the dist.files and dist.archives.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52825 has finished for PR 11603 at commit 8532ee7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53412 has finished for PR 11603 at commit 40c1a19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53664 has finished for PR 11603 at commit 7f41403.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53751 has finished for PR 11603 at commit 6d3c62d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

we should remove SPARK_LOG4J_CONF functionality in Client.scala

have you looked at test failures?

@jerryshao
Copy link
Contributor Author

OK, I will remove this old one. Looks like this unit test is not related.

@jerryshao
Copy link
Contributor Author

Another environment is SPARK_JAVA_OPTS, it is deprecated since 1.0, but lots of places use this variable, do we need to remove this one?

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54025 has finished for PR 11603 at commit f9b62a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

We should file a separate jira for the SPARK_JAVA_OPTS since its used in core and other places.

private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.withDefault(false)

private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation is wrong

@vanzin
Copy link
Contributor

vanzin commented Mar 25, 2016

Mostly good, just a few comments.

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54420 has finished for PR 11603 at commit d152f9f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54499 has finished for PR 11603 at commit 7feae6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(args.addJars, LocalResourceType.FILE, true),
(args.files, LocalResourceType.FILE, false),
(args.archives, LocalResourceType.ARCHIVE, false)
(sparkConf.get(JARS_TO_DISTRIBUTE).orNull, LocalResourceType.FILE, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think you could simplify this a little bit by listing just the options (instead of sparkConf.get(option).orNull); if you made the config entries lists defaulting to Nil (adding toSequence to their builders, and using withDefault instead of optional) you could save even more code below.

@vanzin
Copy link
Contributor

vanzin commented Mar 30, 2016

LGTM, I just left a really minor suggestion to save a few lines of code but no need to address that right now.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54596 has finished for PR 11603 at commit 3bb44b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Apr 1, 2016

Merging to master, thanks!

@asfgit asfgit closed this in 8ba2b7f Apr 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants