Skip to content

Conversation

@ryan-williams
Copy link
Contributor

e.g. OutOfMemoryError on the driver was leading to application
reporting SUCCESS on history server and to YARN RM.

e.g. OutOfMemoryError on the driver was leading to application
reporting SUCCESS on history server and to YARN RM.
@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28970 has started for PR 5130 at commit 5c31522.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28970 has finished for PR 5130 at commit 5c31522.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28970/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Mar 23, 2015

@tgravescs - can you take a look?

@rxin
Copy link
Contributor

rxin commented Mar 23, 2015

@ryan-williams - can you add the jira ticket to the title, and add "YARN", e.g. "[SPARK-1234][YARN]"? Thanks.

@ryan-williams ryan-williams changed the title Report failure status if driver throws exception [SPARK-6449][YARN] Report failure status if driver throws exception Mar 23, 2015
@ryan-williams
Copy link
Contributor Author

oh yea, sry I forgot to do that @rxin

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

If the driver throws an exception, the exception will be the cause of InvocationTargetException. So you are logging the exception from the reflection api rather than the driver code, right?

@sryza
Copy link
Contributor

sryza commented Mar 23, 2015

As @zsxwing says, it appears that the code is already trying to handle this case. Do InvocationTargetExceptions only wrap Exceptions and not all Throwables? If that's the case, then the patch's approach seems to make sense. If they wrap Errors as well, then the fix would be to replace Exception with Throwable in the match block of the InvocationTargetException cause.

Also, how were we ending up with a success before? If anything forced us to break out of that try block, it seems like we wouldn't call finish with SUCCESS . Or does YARN just assume success in the case where we shut down without a report? (I can look this up if you don't know).

Last, what if we run into an OutOfMemoryError on a separate thread?

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

Do InvocationTargetExceptions only wrap Exceptions and not all Throwables?

It will wrap Error, too. Run the following codes in my machine,

class Foo {}

object Foo {

  def main(args: Array[String]): Unit = {
    val a = ArrayBuffer[String]()
    while(true) {
      a += "111111111111111111111111111111"
    }
  }
}

object Bar {

  def main(args: Array[String]): Unit = {
    val mainMethod = classOf[Foo].getMethod("main", classOf[Array[String]])
    mainMethod.invoke(null, null)
  }

}

and it outputs,

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at Bar$.main(Nio.scala:72)
    at Bar.main(Nio.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
    at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
    at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
    at Foo$.main(Nio.scala:62)
    at Foo.main(Nio.scala)
    ... 11 more

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

If they wrap Errors as well, then the fix would be to replace Exception with Throwable in the match block of the InvocationTargetException cause.

This has been fixed in #4773

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

Also, how were we ending up with a success before? If anything forced us to break out of that try block, it seems like we wouldn't call finish with SUCCESS . Or does YARN just assume success in the case where we shut down without a report? (I can look this up if you don't know).

After #4773, it should not end up with a success for OutOfMemoryError.

However, in my experience, if AMRMClient.unregisterApplicationMaster has not been called, Yarn will restart the AM until exceeding the max attempts. So if the user does not create a SparkContext and the driver code exits normally, Yarn still will restart the AM. We have an ugly fix that forcing AM to register and unregister with Yarn if finding AM does not register when exiting.

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

Last, what if we run into an OutOfMemoryError on a separate thread?

Since AM does not set SparkUncaughtExceptionHandler, it will print OutOfMemoryError to syserr if the driver does not set UncaughtExceptionHandler for this thread. I think it may hang Spark's other threads. See #5004 for an example.

@tgravescs
Copy link
Contributor

@ryan-williams Are you seeing this exception with spark 1.3 then or with older version? (ie pr4773 didn't fix this particular issue)

@tgravescs
Copy link
Contributor

However, in my experience, if AMRMClient.unregisterApplicationMaster has not been called, Yarn will restart the AM until exceeding the max attempts. So if the user does not create a SparkContext and the driver code exits normally, Yarn still will restart the AM. We have an ugly fix that forcing AM to register and unregister with Yarn if finding AM does not register when exiting.

@zsxwing Can you clarify this? Are you running something that never starts SparkContext? I'm not sure what you mean by the user doesn't create spark context but the driver exits normally.

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

@zsxwing Can you clarify this? Are you running something that never starts SparkContext? I'm not sure what you mean by the user doesn't create spark context but the driver exits normally.

E.g., my application may check some folders at first. If they exist, it will create SparkContext and run some jobs. If not, it just exits because the data is not ready.

@tgravescs
Copy link
Contributor

That case is basically not handled right now. We expect one of the first things is to create the SparkContext which is why the AM waits for the spark context to be initialized. Anything you do in your program before doing that initialization is relying on the fact that we wait a certain period for it to be initialized and if you never create it, we consider that as failure. Seems like more of a think a workflow manager should be doing but if you want to handle that case I suggest filing a separate jira.

val sc = waitForSparkContextInitialized()

// If there is no SparkContext at this point, just fail the app.
if (sc == null) {
  finish(FinalApplicationStatus.FAILED,
    ApplicationMaster.EXIT_SC_NOT_INITED,
    "Timed out waiting for SparkContext.")

@zsxwing
Copy link
Member

zsxwing commented Mar 23, 2015

@tgravescs Thanks for the clarification

@ryan-williams
Copy link
Contributor Author

It seems like I was running a pre-#4773 Spark. I just ran a job from v1.3.0 that OOMs and it correctly reported FAILED.

For good measure I'm building+running again from right before #4773, but in the meantime I'll close this, thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants