Skip to content

Conversation

@liancheng
Copy link
Contributor

JIRA issue: SPARK-2678

Currently, spark-submit shadows user application options. Namely, options like --help are always captured by spark-submit and won't be passed to the user application. A negative impact of this is that, every time we add a new option to spark-submit, we may potentially break some existing user scripts if the new option name happens to shadow an existing option in the application.

This PR introduced an incompatible change to fix this issue: -- is used as a separator between spark-submit options and user application options. All arguments after -- are passed to the application as is. Thus,

./bin/spark-submit --class Foo user.jar arg1 -arg2 --arg3 x
# or
./bin/spark-submit user.jar --class Foo arg1 -arg2 --arg3 x

must be rewritten to

./bin/spark-submit --class Foo user.jar -- arg1 -arg2 --arg3 x
# or
./bin/spark-submit user.jar --class Foo -- arg1 -arg2 --arg3 x

@pwendell Please help review, thanks. Especially, maybe we need a vote here to decide whether fixing this issue worth an incompatible change.

@vanzin
Copy link
Contributor

vanzin commented Jul 31, 2014

I think this is a welcome change, but I don't like the compatibility-breaking aspect of it. It shouldn't be hard to maintain the current semantics (stop parsing SparkSubmit options when an unknown option is found) while still adding support for "--".

It will be sub-optimal for those who run into the situation you described, but on the other hand it won't break every single existing application out there using spark-submit.

@pwendell
Copy link
Contributor

Hey @liancheng why not just do an alternative fix where we just avoid shadowing user options, but we don't change the documented format. The main issue as I see it is that we aren't currently implementing the specification we give in the docs. I don't think that implies we need to change the format.

@liancheng
Copy link
Contributor Author

@vanzin @pwendell Are you suggesting something like this:

if no "--" exists in CLI arguments
  treat all those options after primary resource as application options
else:
  split them at "--"
  take those before "--" as spark-submit options
  take those after "--" as application options

This logic accepts the following invocation happily:

bin/spark-submit --class Foo user.jar --arg1

But we still need to rewrite the following one (which is mostly used in those scripts that delegates to spark-submit)

bin/spark-submit --class Foo user.jar --master local --arg1

into

bin/spark-submit --class Foo user.jar --master local -- --arg1

Take sbin/start-thriftserver.sh as an example, the core lines of this script is:

CLASS=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
exec "FWDIR"/bin/spark-submit --class $CLASS spark-internal "$@"

Note that $@ may contain both spark-submit options as well as application options by design. Consider the following two invocation:

# Case A
./sbin/start-thriftserver.sh --hiveconf hive.root.logger=DEBUG,console
# Case B
./sbin/start-thriftserver.sh --master local -- --hiveconf hive.root.logger=DEBUG,console

Case A is compatible with the current documented design, while case B still require the user to add -- after --master local. Although a little better, this is still an incompatible change (you can't just pass --master local --hiveconf ... as what we can do now).

(And I haven't updated the documentation yet, will do that after we come to a final conclusion).

@vanzin
Copy link
Contributor

vanzin commented Jul 31, 2014

@liancheng no, I'm suggesting keeping the current behavior, which is defined in SparkSubmitArguments.scala around line 310, starting with this:

case value :: tail =>
  if (inSparkOpts) {

What that code does is, as soon as an unrecognized option is found, it considers everything else as app options. That means that it doesn't matter where in the command line the "primary resource" is.

Examples:

spark-submit /my.jar --master foo --unknown-option 10

--unknown-option 10 becomes argument to main class.

spark-submit --master foo /my.jar --unknown-option 10

Same thing. The issue you point out, and that's true, is that if --unknown-option is added as a SparkSubmit option, it won't be passed to the app anymore. In that case, and that case alone, the command line must be re-written as either of:

spark-submit /my.jar --master foo -- --unknown-option 10
spark-submit --master foo /my.jar -- --unknown-option 10

Note that it doesn't require anyone except the user who wants his application to still get --unknown-option as an argument to do anything.

Basically, don't change any of the existing code; just add a case that handles "--" to stop parsing SparkSubmit options. That should be very, very simple to do. Could be as simple as adding:

case "--" :: tail =>
  childArgs = tail

(Although that's obviously completely untested.)

@sryza
Copy link
Contributor

sryza commented Aug 1, 2014

I'm worried that treating unknown args as app args would make typos difficult to debug.

spark-submit --executor-croes 10
should print out an error, not silently ignore the number of executors I requested

@pwendell
Copy link
Contributor

pwendell commented Aug 1, 2014

Yeah I think we should can keep full backwards compatibility with the current approach and then just also support adding -- as a delimiter between Spark and user arguments.

@pwendell
Copy link
Contributor

pwendell commented Aug 1, 2014

Hey @liancheng I just spoke with @mateiz offline about this for a while. He had feedback on a few things which I'll summarize here. There are a couple orthogonal issues going on here. Here were his suggestions:

  1. _Scripts like start-thriftserver.sh should expose one coherent set of options rather than distinguishing between spark-submit and other options. _ The idea was to make this more convenient/less confusing for users. So this would mean that the scripts would have to match on the more specific options and make sure those are delivered separately to spark-submit. This makes the internals more complicated, but for the users it's simpler. Also, because we control the total set of options for internal tools, we can make sure there is no collision in the naming.
  2. _ If we implement -- as a divider, it should be fully backwards compatible. _ For instance, we need to support users that were doing this in Spark 1.0:
./spark-submit --master local myJar.jar --userOpt a --userOpt b -- --userOpt c

i.e. user programs that used -- in their own options. The way to fully support this is to make the use of -- mutually exclusive with specifying a primary resource. So this means a user can either do:

./spark-submit --master local --jars myJar.jar -- --userOpt a --userOpt b

or they can do

./spark-submit --master local myJar.jar --userOpt a --userOpt b

So basically, when the parser arrives at an unrecognized option (which we assume to be a resource) we always treat the rest of the list as user options, even if the user options happen to have a -- in them.

@liancheng
Copy link
Contributor Author

Thanks for the great feedback, I agree, will update soon.

@liancheng
Copy link
Contributor Author

Also, thank you @vanzin! You're right, it can be done in a downward compatible yet simple way.

@liancheng
Copy link
Contributor Author

Closing this one, #1715 fixes this issue in a downward compatible way. @pwendell @vanzin @sryza Thanks for the review!

@liancheng liancheng closed this Aug 1, 2014
@liancheng liancheng deleted the spark-submit-args branch September 24, 2014 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants