Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Dec 8, 2018

What changes were proposed in this pull request?

This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.

End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)

Please refer the doc change as well as added UTs in this patch to see how to set up the feature.

How was this patch tested?

Added UT, as well as manual test with YARN cluster

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99857 has finished for PR 23260 at commit 65cc6a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 9, 2018

If you're on YARN, this feels like something you would manage via YARN and its cluster management options. Is there a specific use case here, that this has to happen in Spark?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 9, 2018

@srowen
For now executor log url is static in Spark, which forces Node Manager to be alive even after application is finished, in order to provide executor log in SHS.

This situation can be happen when decommission happens a bit frequently, or when end users want a kind of elasticity against YARN cluster (not only decommissioning nodes, but also elasticity on YARN cluster itself - YARN has cluster id for RM which classifies the cluster which can be leveraged when dealing with multiple YARN clusters.)

There's also similar change applied on Hadoop side.
apache/hadoop@5fe1dbf

We are experimenting central log service which resolves above situation. At least the log url for centralized log service can't be same URL as NM webapp, we have to get flexibility of executor log URL.

Hope it explains the rationalization well.

@srowen
Copy link
Member

srowen commented Dec 10, 2018

Ok, got it. @vanzin or @squito or others would be better able to evaluate.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have unit tests.

Copy link
Contributor

@vanzin vanzin Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the full address of the NM HTTP server? Because it feels a little conflicting with the above variable.

Also I'm not sure what "node on container" means. Perhaps "node where container was run".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. It is host:port instead of URI which can be retrieved from container.getNodeHttpAddress. The representation of node on container is borrowed from javadoc of this method, but I'm OK to use anything more clarified.

Will address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be converted to the default value of the log configuration? Seems like all the variables here match the ones you're using there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will remove the branch. Will address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants are only used in the methods below. Also, the methods below are only called from a single place.

Seems to me you should have a single method that implements all this logic. You could also avoid this new object, for the same reasons.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. I'm in favor of avoiding to use string constant directly, but not strong opinion on it. Will address.

And yes I can put them in a single method, but placing a new method into class will bring unnecessary burden to the test code, since ExecutorRunnable receives lots of parameters to be instantiated.

If we want to add an end-to-end test (instantiating YARN cluster and running executors) we still need to instantiate ExecutorRunnable (I think we are already covering it from here [1]), but if we just want to make sure the logic works properly, we might want to keep this as new object and add a test against the object to avoid instantiating ExecutorRunnable. WDYT?

  1. // If we are running in yarn-cluster mode, verify that driver logs links and present and are
    // in the expected format.
    if (conf.get("spark.submit.deployMode") == "cluster") {
    assert(listener.driverLogs.nonEmpty)
    val driverLogs = listener.driverLogs.get
    assert(driverLogs.size === 2)
    assert(driverLogs.contains("stderr"))
    assert(driverLogs.contains("stdout"))
    val urlStr = driverLogs("stderr")
    driverLogs.foreach { kv =>
    val log = Source.fromURL(kv._2).mkString
    assert(
    !log.contains(SECRET_PASSWORD),
    s"Driver logs contain sensitive info (${SECRET_PASSWORD}): \n${log} "
    )
    }
    val containerId = YarnSparkHadoopUtil.getContainerId
    val user = Utils.getCurrentUserName()
    assert(urlStr.endsWith(s"/node/containerlogs/$containerId/$user/stderr?start=-4096"))
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the Spark style for multi-line method declarations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's allowed when it fits within two-lines, but no problem to change it. Will address.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly, the guide line itself does not explicitly mention about two lines - it says it's okay if it fits within 2 lines (see databricks/scala-style-guide#64 (comment)). I intentionally avoided this because some of codes in some components do not comply this.

However, strictly we should better stick to two spaces indentation whenever possible per https://github.com/databricks/scala-style-guide#indent

Use 2-space indentation in general.

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99955 has finished for PR 23260 at commit dbeade7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

@vanzin Thanks for the detailed review! Addressed review comments.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the point of this. The yarn change you mentioned, apache/hadoop@5fe1dbf / https://issues.apache.org/jira/browse/YARN-8964, only adds support for an optional clusterId query param, which seems different from this.

To make sure I understand correctly, by itself this is just allowing you set a different URL where the logs will be available -- but it doesn't do anything to actually make those logs available anywhere else. I guess you have other changes in your setup to have those logs written somewhere else in the first place? I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4-space indent for method parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I just confused indent rule between method parameters and return... Nice catch. Will address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double-indent (4 spaces) the continuation line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address.

@vanzin
Copy link
Contributor

vanzin commented Dec 11, 2018

I'm not sure I understand the point of this

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

That does open up some questions though. The code being modified is in the AM, which means that the user needs to opt into this when he submits the app, when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful.

I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 11, 2018

@squito @vanzin

I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized?

Yes, exactly.

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of the clusterId of RM represents that YARN opens the possibility of maintaining multiple YARN clusters and provides centralized services which operates with multiple YARN clusters.

when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful.
I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal.

I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with.

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99992 has finished for PR 23260 at commit 4c865fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 11, 2018

This is the other PR I was referring to: #20326

@HeartSaVioR
Copy link
Contributor Author

@vanzin
IMHO we can't go with approach #20326 went since cluster ID is not available in default log URL so cannot retrieve. Also for me relying on pattern (container_) in URL feels fragile: while #20326 only extracted container ID from URL, we need to get some other parameters as well, which opens bigger possibility to be broken.

@squito
Copy link
Contributor

squito commented Dec 12, 2018

I'm sorry, I still don't totally follow, partially from my ignorance of some details in yarn here which I'd like to better understand. Marcelo's comment sounds like a good use of this:

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

So this change would allow the user to do that, if they change these parameters when they submit their application? It sounds like with this patch, if you configure things correctly, you can get the JHS / ATS to display those logs (I guess this works using yarn's log aggregation?). If so, it might make sense to include a description of how you'd configure things that way in the docs (it isn't obvious to me how to do it, anyway).

Or maybe I'm still not quite following, and there is some 3rd party piece here, outside of spark & yarn, which collects the logs and can serve them later on, whether or not the NM can serve the logs?

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2018

if they change these parameters when they submit their application?

Right, and that's the part that worries me. Both because the user has to do that (well, the admin could put the values in the default Spark properties file), and also because I'm not sure about what's the behavior while the app is running. If you go to the live UI, and click on the log link, where does that take you?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 12, 2018

@squito

Or maybe I'm still not quite following, and there is some 3rd party piece here, outside of spark & yarn, which collects the logs and can serve them later on, whether or not the NM can serve the logs?

Exactly, and collecting logs can also happen while app is running.

For now I would rather say there's 3rd party here, but Hadoop side is trying to leverage clusterId which I guess Hadoop is also going to have multi-clusters awareness, so it's not impossible for Hadoop/YARN to include multi-clusters aware centralized services in future.

@vanzin

because the user has to do that (well, the admin could put the values in the default Spark properties file)

In practice, Admin will put the value in Spark properties. I agree it doesn't sound good if end users can override it, but not sure Spark can prevent it. Please let me know if there's a way for Spark to only read from Spark property file and not allowing end users to override it while submitting. I'm not aware of it and I'll use once it exists.

I'm not sure about what's the behavior while the app is running. If you go to the live UI, and click on the log link, where does that take you?

Centralized log services (whatever they exist) will provide the log in unique URLs and Spark will always point to these URLs.

Suppose the log service knows the status of NM and application, then the service can do anything which we are serving now. If NM is live, the service could redirect/forward to NM's log URL, or just serve stored log file which is continuously pulled from NM. (For latter it may represent a bit outdated log, but it just depends on when to pull which is just a detail on the log service, not the thing Spark should worry about) If not, it will serve stored log file pulled before NM goes offline.

In any way, I wish we don't end up with dealing with static URL, and provide some flexibility on 3rd party and end users.

@squito
Copy link
Contributor

squito commented Dec 12, 2018

OK, so while this might be useful with the JHS / ATS, @HeartSaVioR 's real use case is with some external log management system, and this works both for live apps & after the app is complete, even if the NM is gone.

Is this log aggregation system something publicly available, or something private? I'm a bit reluctant to add these configs with such limited applicability.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 12, 2018

It's something private for now. Btw, I feel this as egg and chicken problem. Once we open flexibility 3rd party can leverage on it. Once we close down the possibility, 3rd party will not even give a try. For example, Spark just opens possibility on leveraging dropwizard metrics, so there're many metrics sinks directly supported in Spark or they could be from 3rd party.

@squito
Copy link
Contributor

squito commented Dec 13, 2018

yes, I see your point about the chicken and egg. I also wonder if this feature should not be so yarn-specific then -- in fact, it almost seems more important on kubernetes, as there is no long-lived NM there. But maybe the params you need end up being specific to deployment mode, (eg. {{ContainerId}}) so there is no general solution.

I'm inclined to wait on this a while till we see if there is a way to get this work more generally, or maybe even to work with yarn even while the app is running; but I don't feel so strongly I'm blocking it, either. @vanzin do you ahve more thoughts?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 14, 2018

@squito

I also wonder if this feature should not be so yarn-specific then -- in fact, it almost seems more important on kubernetes, as there is no long-lived NM there.

100% agreed, and I imagine Hadoop is taking same way as containerized, being configured in cloud and support elasticity which brings the needs on YARN side.

But maybe the params you need end up being specific to deployment mode, (eg. {{ContainerId}}) so there is no general solution.

Agreed. Path parameters should be specific to deployment mode, given that concepts can be different as well as terms/components can be different for deployment modes.

maybe even to work with yarn even while the app is running;

I guess I got same question from @vanzin and I answered it. Could you elaborate this if my answer doesn't address what you're considering?

@squito
Copy link
Contributor

squito commented Dec 14, 2018

well, I thought you said above that this does not integrate that well with JHS / ATS currently, because we're not sure what will happen with running jobs. You only answered about running applications wrt your own private log service.

If this does work for a live application as well as completed applications, than I am in favor of this change.

@HeartSaVioR
Copy link
Contributor Author

@squito
If we don't touch the default value it is just running as it is for now because the default URL is set to static URL of NM Spark has been using. So this doesn't break anything Spark currently has.

This patch provides flexibility of executor log URL, not only our case but also all of the cases if log URL can be represented via provided patterns. If it turns out the set of patterns are not enough in future, we can simply add missing patterns when YARN runner can provide them. (We can't predict and enumerate everything.) If there's a case YARN runner cannot provide the pattern some services need to have to provide logs, there's no way for Spark to determine and set such URL to log URL.

Regarding running application vs finished application, even NM provides unique URL and redirects URL when the application is finished. I don't think other services cannot do it, but if we have concrete case which running application and finished application should have different log URL (this would mean we may need to change log URL only for finished application) I can take a look and address it.

Does it make sense to you?

@vanzin
Copy link
Contributor

vanzin commented Dec 17, 2018

I agree with Imran that it would be best to think about how to properly support this regardless of resource manager; including allowing applications and SHS to have different URL templates. Otherwise you end up just hardcoding a different URL; if the log server URL needs to change, all your previous applications will have broken links.

e.g. the application could save the parameters somehow in the event log, and the SHS could use those when replacing things in its own value for the log URL.

The parameters don't need to be pre-defined, each RM can have its own set, and it's up to the admin to set things up so that the URLs in the SHS make sense for their deployment.

@HeartSaVioR
Copy link
Contributor Author

OK. I guess I'm seeing what we concern about. Looks like we don't want to set specific URL to env which could be a permanently dead link at any chance (like webserver is moving, we changed pattern of URL, etc.)

I feel this concern has been sitting in Spark for years (imagine the format of YARN url had to be changed in prior version) so I feel we have time to address it incrementally (doing better than current), but I also think we can investigate what to do to achieve that and just do it at once if it doesn't require huge efforts.

The parameters don't need to be pre-defined, each RM can have its own set, and it's up to the admin to set things up so that the URLs in the SHS make sense for their deployment.

I'm not sure I follow. While some parameters can be retrieved from YARN config but there're other parameters we get from allocated container (and maybe other things we can't get from YARN config). Could you please elaborate?

@vanzin
Copy link
Contributor

vanzin commented Dec 17, 2018

While some parameters can be retrieved from YARN config but there're other parameters we get from allocated container

Yes, and that's part of what needs to be solved when thinking about a generic solution. I'm not giving you a solution, I'm saying what I would expect from a solution. How to achieve that is a different discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do Utils.tryLog(YarnConfiguration.getClusterId(conf)).toOption

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Sorry I missed this while dealing with other stuff. Will address.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at Utils.tryLog and that doesn't seem to fit my intention, since it logs error message which end users could get a wrong sign (as well as it catches exception too broadly). If RM cluster id is a mandatory config we may need to take a different approach (like fail-fast). If that is optional, I think we should not leave an error log.

@HeartSaVioR
Copy link
Contributor Author

I'm OK to think about having general solution / different URLs between UI in driver / SHS, except parameters.

I have been thinking about parameters being general / flexible but still don't get it.

  1. Generalization between resource managers don't look like possible, due to differences of term, concept, provided information, etc.
  2. Even same resource manager, I'm not sure how much we want to be flexible. (That's the main reason I ask for elaborating.) More flexibility requires storing possibly unnecessary things (which might even contain credential one) in event log for every app. Without seeing definition of the specification and its possibility, I'm a bit worried that it can just play as a high bar/wall.

@vanzin
Copy link
Contributor

vanzin commented Jan 2, 2019

A general solution requires two things:

  • the application to save some information in the event log so that the SHS can use it
  • the SHS to use that information to build the log URLs when building an application's UI later

The mechanism to do that doesn't need to be RM-specific. The data (i.e. the information saved in the event logs, the log URL template itself) can be, but the mechanism doesn't need to.

If you think about the code you have, you already defined a set of parameters for the log URL. All you need is, instead of parsing the log URL on the application side, do that on the SHS side, based on data written by the app. You can even only implement that for YARN, but you'd have a generic mechanism that later can also be implemented for other RMs without having to change the SHS.

@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100716 has finished for PR 23260 at commit 4c865fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

I'm afraid I'm over-thinking about this, but still don't feel easy to tackle it: not just moving logic to SHS, but also defining some interfaces and caring to run the implementation.

the application to save some information in the event log so that the SHS can use it

I think you mean exposing interface for end users to define a set of parameters for the log URL and let them be stored to event log when launching executor in resource manager. Do I understand correctly?

For specific to YARN, the application should be run in ExecutorRunnable (in AM process) which has all the information: this means we may need to care about classpath for the implementation.

I also would like to be clear about application: are you suggesting it as a part of spark application (so end users should have it in their app jar) or an individual (global) plugin (so the jar should be placed or linked in spark classpath)? I guess a Spark application should run with any resource managers, and this may open the possibility to tie the application to specific resource manager.

Btw, SPARK_LOG_URL_ prefix is only used for standalone and YARN resource manager: not sure how mesos and k8s provide executors' log URLs. May need to spend some times to get familiar with other resource managers as well.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jan 29, 2019

I'm seeing flakiness of YARN tests after rebasing, specifically after rebasing against SPARK-22404 (#19616). More failures than succeeds. I'm triggering the build multiple times, as well as running test in local dev. We should read container logs to track down the issue but I couldn't do against Jenkins build so have to deal with local env.

@vanzin
Copy link
Contributor

vanzin commented Jan 29, 2019

I ran the tests a few times locally without your changes and they seem fine. Also couldn't find any failures on jenkins.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jan 29, 2019

Looks like in unmanaged mode some attributes are not available.

19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: 19/01/29 10:55:03 INFO YarnContainerInfoHelper: Error while building executor logs - executor logs will not be available
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: java.lang.NumberFormatException: null
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:542)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:615)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerHttpPort(YarnContainerInfoHelper.scala:105)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getLogUrls(YarnContainerInfoHelper.scala:41)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4(ExecutorRunnable.scala:240)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4$adapted(ExecutorRunnable.scala:239)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.Option.foreach(Option.scala:274)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:239)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:89)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:66)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:532)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Thread.run(Thread.java:748)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: 19/01/29 10:55:03 INFO YarnContainerInfoHelper: Error while retrieving executor attributes - executor logs will not be replaced with custom log pattern
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver: java.lang.NumberFormatException: null
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:542)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:615)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerPort(YarnContainerInfoHelper.scala:103)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getAttributes(YarnContainerInfoHelper.scala:64)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4(ExecutorRunnable.scala:246)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4$adapted(ExecutorRunnable.scala:239)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.Option.foreach(Option.scala:274)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:239)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:89)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:66)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:532)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Thread.run(Thread.java:748)

First one:

org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerHttpPort(YarnContainerInfoHelper.scala:105)

Second one:

at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerPort(YarnContainerInfoHelper.scala:103)

which will make attribute Map being empty and fail the test. Please note that it sometimes works, not always fails.

Is it expected behavior for unmanaged mode to not have NM http port as well as NM port in system env? If then we can change the test to not check attributes. If not we need to investigate it.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jan 29, 2019

And very oddly, looks like unmanaged mode is running in "yarn-client" mode, but the test only verifies map when deploy mode is "cluster".

Never mind. I missed tracking the code line. Executor logs don't look presented. I'll find the reason.

…ntly

* they're only available in container's env - cannot retrieve them outside of container process
@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101782 has finished for PR 23260 at commit 98b7b16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101781 has finished for PR 23260 at commit 0df31ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

I just fixed a bug: the reason is we cannot retrieve NM_HOST / NM_PORT / NM_HTTP_PORT outside of container process for given container. In YARN cluster mode we can get them for driver, but not for executors in both cluster and client mode.

In order to roll back these attributes to NM_HTTP_ADDRESS, I had to remove explanation on setting up custom log URL to JHS URL because it refers to NM_PORT (IPC) which cannot be retrieved.

We might try moving out extracting log urls and attributes to YARN executor side and see whether it works, but I'd hope addressing that in other issue. I would like to get it done with stable shape first, and try out more.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

Commit trigger and two manual requests : 3 builds. Let's see how it goes. I checked YarnClusterSuite passed in local devs 3 times in a row.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101796 has finished for PR 23260 at commit 2d04802.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101797 has finished for PR 23260 at commit 2d04802.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Two build failures don't look relevant to YARN tests. Will retrigger tests...

@HeartSaVioR
Copy link
Contributor Author

Retest this, please.

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

Retest this, please.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101802 has finished for PR 23260 at commit 2d04802.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


def getNodeManagerHttpAddress(container: Option[Container]): String = container match {
case Some(c) => c.getNodeHttpAddress
case None => System.getenv(Environment.NM_HOST.name()) + ":" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, your comments about the unmanaged AM reminded me of something else.

This code is always running in the same container (the AM, managed or not). So even in the managed case, this is not correct: you'd be returning the host:port for the NM running the AM, not the one that will be running the container.

Seem like we have to figure out how to have the executor build and return this data. This would solve the above problem, and also properly work with the unmanaged AM.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is only used for cluster mode - to pop up driver's log urls as well as attributes in container. For executors we always pass the container we got assigned, and client mode is not affected by the code - if my understanding is right, unmanaged AM falls into the case (client mode) so we don't need to worry about that.

This is taking same path as as-is, just refactored. The fix is correct, in other words, before the last fix it was incorrect, and my previous comment on observation of test failure was also wrong. I might not have enough understanding of how yarn mode works. Sorry about that.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem like we have to figure out how to have the executor build and return this data.

I'm working on the patch - below commit passes the tests.
HeartSaVioR@3c09646

I guess we need to have another review on the new change once I apply above commit to here - if we need to go through multiple phases of reviews for new commit, could we do it in new JIRA issue/PR instead? We already put lots of efforts (review, apply, new requirements) on this PR and I want to make it just be done sooner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI: #23706 addressed the requirement so you may find interesting.

@vanzin
Copy link
Contributor

vanzin commented Jan 30, 2019

Ok, we can deal with unmanaged client mode separately. Took another look and it looks good.

Merging to master.

@asfgit asfgit closed this in ae5b2a6 Jan 30, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks all for detailed reviewing and merging! I'll file a new issue and raise a PR soon for executor self-retrieving information.

@HeartSaVioR HeartSaVioR deleted the SPARK-26311 branch January 30, 2019 21:09
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…cutor log URLs in SHS

## What changes were proposed in this pull request?

This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.

End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)

Please refer the doc change as well as added UTs in this patch to see how to set up the feature.

## How was this patch tested?

Added UT, as well as manual test with YARN cluster

Closes apache#23260 from HeartSaVioR/SPARK-26311.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants