Skip to content

Conversation

@zasdfgbnm
Copy link
Contributor

What changes were proposed in this pull request?

In the spark-class shell script, the exit code of the launcher is appended to the end of the output. However, when the launcher fails, their might not be a '\0' at the end of the launcher's output, which makes the test code [ $LAUNCHER_EXIT_CODE != 0 ] abort with a error [: too many arguments

This patch fixes this bug by changing the way the exit code of the launcher is passed

How was this patch tested?

manually tested

change the way the exit code of the laucher is handled so that there won't be a
problem when the launcher fails
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

# won't fail if some path of the user contain special characher such as '\n' or space
#
# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586]
_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the problem that the JVM may not be able to allocate a 128m heap? I don't think that's supported at all; nothing will work. I don't think we want to complicate this much just to make a better error message, but if there's a particular exit status that indicates this that you can use to report a better message, OK. I wouldn't make the rest of this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I found this problem when I try to run spark for test on a login node of a HPC, which limit the amount of memory applications can use. This is not some big problem, so it's ok to mark this as "won't fix", but this might be confusing to users who don't understand shell scripts. Actually this patch is a small change and it reduce the total lines of codes (but I agree that this line is a little bit tricky and harder to understand). That's why I add some comments to explain what's happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the previous code assume that the output of the launcher end with a '\0' otherwise it will crash due to the same problem. I read the codes in org.apache.spark.launcher. It seems that the main of launcher rarely output something and exit a non-zero. So the only possibility that this is a problem is there are uncaught exceptions or there is something wrong with java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to just check the exit status of this process with $? and exit with a better message if not 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem to achieve this is: If success, the exit status of the process will be 0 and the output of the process will be something sperate by '\0'. Bash don't allow us to store '\0' in a bash variable, we should find some alternative way to store it, which makes $? no longer the exit status of the launcher..

Copy link
Contributor Author

@zasdfgbnm zasdfgbnm Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things can also be implemented like

# line 70-73
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  LAUNCHER_EXIT_CODE=$?
  if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then printf "\0"; fi
  printf "%d\0" $LAUNCHER_EXIT_CODE
}

and

# line 83-87
CMD=("${CMD[@]:0:$LAST}")
if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then
  echo "${CMD[@]}"
  exit $LAUNCHER_EXIT_CODE
fi

Which will minimize the change to the original code
If you prefer me to change the PR to do things like above, let me know, and I'd love to do that, or you can manually fix this easy bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'm talking about something much simpler.

build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  if [ $? != 0 ]; then 
    echo "Couldn't run launcher"
    exit $?
  fi
  printf "%d\0" $?
}

Does that make sense?

Copy link
Contributor Author

@zasdfgbnm zasdfgbnm Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work. build_command is executed in a subshell. The exit $? will only terminate the subshell.
You can try this on a bash shell

build_command(){   printf "aaaa\0";  exit 1; }
CMD=()
while IFS= read -d '' -r ARG; do   CMD+=("$ARG"); done < <(build_command)

You will see that the shell didn't exit, and CMD is set to aaaa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. OK, next question. Does that really need to be in a sub shell? You mentioned it has something to do with not being able to store the null char in bash -- that makes it impossible to just store the output of this command directly in a variable?

Really I'd punt to @vanzin who will be a more informed reviewer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not find a neat solution to deal with the \0...
Another option is to replace the \0 with \n in and store it in a variable. But this will be a problem if the user's path of spark contains a \n...

@vanzin
Copy link
Contributor

vanzin commented Jul 29, 2016

I think the change is a little confusing and doesn't really explain what's going on. I ran some tests ("ulimit -v" is your friend), and this patch covers what's wrong here.

diff --git a/bin/spark-class b/bin/spark-class
index 658e076..cd19e30 100755
--- a/bin/spark-class
+++ b/bin/spark-class
@@ -80,8 +80,17 @@ done < <(build_command "$@")
 COUNT=${#CMD[@]}
 LAST=$((COUNT - 1))
 LAUNCHER_EXIT_CODE=${CMD[$LAST]}
-if [ $LAUNCHER_EXIT_CODE != 0 ]; then
-  exit $LAUNCHER_EXIT_CODE
+
+# Certain JVM failures result in error being printed to stdout (instead of stderr), which causes
+# the code that parses the output of the launcher to get confused. In those cases, we check if
+# the exit code is an integer, and if it's not, handle it as a special error case.
+if [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
+  if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then
+    exit "$LAUNCHER_EXIT_CODE"
+  fi
+else
+  echo "${CMD[@]}" | head -n-1 1>&2
+  exit 1
 fi

 CMD=("${CMD[@]:0:$LAST}")

Here's the error output:

$ ./bin/spark-submit 
Error occurred during initialization of VM
Unable to load native library: /apps/jdk1.7.0_80/jre/lib/amd64/libjava.so: failed to map segment from shared object

@zasdfgbnm
Copy link
Contributor Author

Yes this patch looks clearer

vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 5, 2016
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes apache#14231
@srowen
Copy link
Member

srowen commented Aug 5, 2016

Let's close this in favor of #14508

@zasdfgbnm zasdfgbnm closed this Aug 6, 2016
asfgit pushed a commit that referenced this pull request Aug 8, 2016
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes #14231

Author: Marcelo Vanzin <[email protected]>

Closes #14508 from vanzin/SPARK-16586.

(cherry picked from commit 1739e75)
Signed-off-by: Marcelo Vanzin <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 8, 2016
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes #14231

Author: Marcelo Vanzin <[email protected]>

Closes #14508 from vanzin/SPARK-16586.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants