-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16586] Change the way the exit code of launcher is handled to avoid problem when launcher fails #14231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
change the way the exit code of the laucher is handled so that there won't be a problem when the launcher fails
|
Can one of the admins verify this patch? |
| # won't fail if some path of the user contain special characher such as '\n' or space | ||
| # | ||
| # Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] | ||
| _CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the problem that the JVM may not be able to allocate a 128m heap? I don't think that's supported at all; nothing will work. I don't think we want to complicate this much just to make a better error message, but if there's a particular exit status that indicates this that you can use to report a better message, OK. I wouldn't make the rest of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I found this problem when I try to run spark for test on a login node of a HPC, which limit the amount of memory applications can use. This is not some big problem, so it's ok to mark this as "won't fix", but this might be confusing to users who don't understand shell scripts. Actually this patch is a small change and it reduce the total lines of codes (but I agree that this line is a little bit tricky and harder to understand). That's why I add some comments to explain what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the previous code assume that the output of the launcher end with a '\0' otherwise it will crash due to the same problem. I read the codes in org.apache.spark.launcher. It seems that the main of launcher rarely output something and exit a non-zero. So the only possibility that this is a problem is there are uncaught exceptions or there is something wrong with java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to just check the exit status of this process with $? and exit with a better message if not 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest problem to achieve this is: If success, the exit status of the process will be 0 and the output of the process will be something sperate by '\0'. Bash don't allow us to store '\0' in a bash variable, we should find some alternative way to store it, which makes $? no longer the exit status of the launcher..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things can also be implemented like
# line 70-73
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
LAUNCHER_EXIT_CODE=$?
if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then printf "\0"; fi
printf "%d\0" $LAUNCHER_EXIT_CODE
}
and
# line 83-87
CMD=("${CMD[@]:0:$LAST}")
if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then
echo "${CMD[@]}"
exit $LAUNCHER_EXIT_CODE
fi
Which will minimize the change to the original code
If you prefer me to change the PR to do things like above, let me know, and I'd love to do that, or you can manually fix this easy bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I'm talking about something much simpler.
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
if [ $? != 0 ]; then
echo "Couldn't run launcher"
exit $?
fi
printf "%d\0" $?
}
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work. build_command is executed in a subshell. The exit $? will only terminate the subshell.
You can try this on a bash shell
build_command(){ printf "aaaa\0"; exit 1; }
CMD=()
while IFS= read -d '' -r ARG; do CMD+=("$ARG"); done < <(build_command)
You will see that the shell didn't exit, and CMD is set to aaaa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. OK, next question. Does that really need to be in a sub shell? You mentioned it has something to do with not being able to store the null char in bash -- that makes it impossible to just store the output of this command directly in a variable?
Really I'd punt to @vanzin who will be a more informed reviewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not find a neat solution to deal with the \0...
Another option is to replace the \0 with \n in and store it in a variable. But this will be a problem if the user's path of spark contains a \n...
|
I think the change is a little confusing and doesn't really explain what's going on. I ran some tests ("ulimit -v" is your friend), and this patch covers what's wrong here. Here's the error output: |
|
Yes this patch looks clearer |
Some very rare JVM errors are printed to stdout, and that confuses the code in spark-class. So add a check so that those cases are detected and the proper error message is shown to the user. Tested by running spark-submit after setting "ulimit -v 32000". Closes apache#14231
|
Let's close this in favor of #14508 |
Some very rare JVM errors are printed to stdout, and that confuses the code in spark-class. So add a check so that those cases are detected and the proper error message is shown to the user. Tested by running spark-submit after setting "ulimit -v 32000". Closes #14231 Author: Marcelo Vanzin <[email protected]> Closes #14508 from vanzin/SPARK-16586. (cherry picked from commit 1739e75) Signed-off-by: Marcelo Vanzin <[email protected]>
Some very rare JVM errors are printed to stdout, and that confuses the code in spark-class. So add a check so that those cases are detected and the proper error message is shown to the user. Tested by running spark-submit after setting "ulimit -v 32000". Closes #14231 Author: Marcelo Vanzin <[email protected]> Closes #14508 from vanzin/SPARK-16586.
What changes were proposed in this pull request?
In the spark-class shell script, the exit code of the launcher is appended to the end of the output. However, when the launcher fails, their might not be a '\0' at the end of the launcher's output, which makes the test code
[ $LAUNCHER_EXIT_CODE != 0 ]abort with a error[: too many argumentsThis patch fixes this bug by changing the way the exit code of the launcher is passed
How was this patch tested?
manually tested