[SPARK-26311][CORE] New feature: apply custom log URL pattern for executor log URLs in SHS #23260

HeartSaVioR · 2018-12-08T06:41:17Z

What changes were proposed in this pull request?

This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.

End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)

Please refer the doc change as well as added UTs in this patch to see how to set up the feature.

How was this patch tested?

Added UT, as well as manual test with YARN cluster

SparkQA · 2018-12-08T07:04:43Z

Test build #99857 has finished for PR 23260 at commit 65cc6a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-12-09T17:55:10Z

If you're on YARN, this feels like something you would manage via YARN and its cluster management options. Is there a specific use case here, that this has to happen in Spark?

HeartSaVioR · 2018-12-09T23:49:08Z

@srowen
For now executor log url is static in Spark, which forces Node Manager to be alive even after application is finished, in order to provide executor log in SHS.

This situation can be happen when decommission happens a bit frequently, or when end users want a kind of elasticity against YARN cluster (not only decommissioning nodes, but also elasticity on YARN cluster itself - YARN has cluster id for RM which classifies the cluster which can be leveraged when dealing with multiple YARN clusters.)

There's also similar change applied on Hadoop side.
apache/hadoop@5fe1dbf

We are experimenting central log service which resolves above situation. At least the log url for centralized log service can't be same URL as NM webapp, we have to get flexibility of executor log URL.

Hope it explains the rationalization well.

srowen · 2018-12-10T00:37:09Z

Ok, got it. @vanzin or @squito or others would be better able to evaluate.

vanzin

This should have unit tests.

vanzin · 2018-12-10T22:11:05Z

docs/running-on-yarn.md

Is this the full address of the NM HTTP server? Because it feels a little conflicting with the above variable.

Also I'm not sure what "node on container" means. Perhaps "node where container was run".

My bad. It is host:port instead of URI which can be retrieved from container.getNodeHttpAddress. The representation of node on container is borrowed from javadoc of this method, but I'm OK to use anything more clarified.

Will address.

vanzin · 2018-12-10T22:11:47Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

Could this just be converted to the default value of the log configuration? Seems like all the variables here match the ones you're using there.

Yes it will remove the branch. Will address.

vanzin · 2018-12-10T22:13:29Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

These constants are only used in the methods below. Also, the methods below are only called from a single place.

Seems to me you should have a single method that implements all this logic. You could also avoid this new object, for the same reasons.

Ah OK. I'm in favor of avoiding to use string constant directly, but not strong opinion on it. Will address.

And yes I can put them in a single method, but placing a new method into class will bring unnecessary burden to the test code, since ExecutorRunnable receives lots of parameters to be instantiated.

If we want to add an end-to-end test (instantiating YARN cluster and running executors) we still need to instantiate ExecutorRunnable (I think we are already covering it from here [1]), but if we just want to make sure the logic works properly, we might want to keep this as new object and add a test against the object to avoid instantiating ExecutorRunnable. WDYT?

spark/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala

Lines 442 to 461 in 05cf81e

// If we are running in yarn-cluster mode, verify that driver logs links and present and are

// in the expected format.

if (conf.get("spark.submit.deployMode") == "cluster") {

assert(listener.driverLogs.nonEmpty)

val driverLogs = listener.driverLogs.get

assert(driverLogs.size === 2)

assert(driverLogs.contains("stderr"))

assert(driverLogs.contains("stdout"))

val urlStr = driverLogs("stderr")

driverLogs.foreach { kv =>

val log = Source.fromURL(kv._2).mkString

assert(

!log.contains(SECRET_PASSWORD),

s"Driver logs contain sensitive info (${SECRET_PASSWORD}): \n${log} "

)

}

val containerId = YarnSparkHadoopUtil.getContainerId

val user = Utils.getCurrentUserName()

assert(urlStr.endsWith(s"/node/containerlogs/$containerId/$user/stderr?start=-4096"))

}

vanzin · 2018-12-10T22:14:31Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

This is not the Spark style for multi-line method declarations.

I guess it's allowed when it fits within two-lines, but no problem to change it. Will address.

Strictly, the guide line itself does not explicitly mention about two lines - it says it's okay if it fits within 2 lines (see databricks/scala-style-guide#64 (comment)). I intentionally avoided this because some of codes in some components do not comply this.

However, strictly we should better stick to two spaces indentation whenever possible per https://github.com/databricks/scala-style-guide#indent

Use 2-space indentation in general.

SparkQA · 2018-12-11T07:30:35Z

Test build #99955 has finished for PR 23260 at commit dbeade7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-12-11T07:45:05Z

@vanzin Thanks for the detailed review! Addressed review comments.

squito

I'm not sure I understand the point of this. The yarn change you mentioned, apache/hadoop@5fe1dbf / https://issues.apache.org/jira/browse/YARN-8964, only adds support for an optional clusterId query param, which seems different from this.

To make sure I understand correctly, by itself this is just allowing you set a different URL where the logs will be available -- but it doesn't do anything to actually make those logs available anywhere else. I guess you have other changes in your setup to have those logs written somewhere else in the first place? I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized?

squito · 2018-12-11T16:19:22Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

4-space indent for method parameters

Ah yes I just confused indent rule between method parameters and return... Nice catch. Will address.

squito · 2018-12-11T16:27:19Z

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnLogUrlSuite.scala

nit: double-indent (4 spaces) the continuation line

Will address.

vanzin · 2018-12-11T20:12:22Z

I'm not sure I understand the point of this

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

That does open up some questions though. The code being modified is in the AM, which means that the user needs to opt into this when he submits the app, when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful.

I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal.

HeartSaVioR · 2018-12-11T22:28:48Z

@squito @vanzin

I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized?

Yes, exactly.

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of the clusterId of RM represents that YARN opens the possibility of maintaining multiple YARN clusters and provides centralized services which operates with multiple YARN clusters.

when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful.
I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal.

I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with.

SparkQA · 2018-12-11T22:54:52Z

Test build #99992 has finished for PR 23260 at commit 4c865fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-12-11T23:30:32Z

This is the other PR I was referring to: #20326

HeartSaVioR · 2018-12-12T00:46:38Z

@vanzin
IMHO we can't go with approach #20326 went since cluster ID is not available in default log URL so cannot retrieve. Also for me relying on pattern (container_) in URL feels fragile: while #20326 only extracted container ID from URL, we need to get some other parameters as well, which opens bigger possibility to be broken.

squito · 2018-12-12T17:17:04Z

I'm sorry, I still don't totally follow, partially from my ignorance of some details in yarn here which I'd like to better understand. Marcelo's comment sounds like a good use of this:

My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on.

So this change would allow the user to do that, if they change these parameters when they submit their application? It sounds like with this patch, if you configure things correctly, you can get the JHS / ATS to display those logs (I guess this works using yarn's log aggregation?). If so, it might make sense to include a description of how you'd configure things that way in the docs (it isn't obvious to me how to do it, anyway).

Or maybe I'm still not quite following, and there is some 3rd party piece here, outside of spark & yarn, which collects the logs and can serve them later on, whether or not the NM can serve the logs?

vanzin · 2018-12-12T17:20:00Z

if they change these parameters when they submit their application?

Right, and that's the part that worries me. Both because the user has to do that (well, the admin could put the values in the default Spark properties file), and also because I'm not sure about what's the behavior while the app is running. If you go to the live UI, and click on the log link, where does that take you?

HeartSaVioR · 2018-12-12T20:49:55Z

@squito

Or maybe I'm still not quite following, and there is some 3rd party piece here, outside of spark & yarn, which collects the logs and can serve them later on, whether or not the NM can serve the logs?

Exactly, and collecting logs can also happen while app is running.

For now I would rather say there's 3rd party here, but Hadoop side is trying to leverage clusterId which I guess Hadoop is also going to have multi-clusters awareness, so it's not impossible for Hadoop/YARN to include multi-clusters aware centralized services in future.

@vanzin

because the user has to do that (well, the admin could put the values in the default Spark properties file)

In practice, Admin will put the value in Spark properties. I agree it doesn't sound good if end users can override it, but not sure Spark can prevent it. Please let me know if there's a way for Spark to only read from Spark property file and not allowing end users to override it while submitting. I'm not aware of it and I'll use once it exists.

I'm not sure about what's the behavior while the app is running. If you go to the live UI, and click on the log link, where does that take you?

Centralized log services (whatever they exist) will provide the log in unique URLs and Spark will always point to these URLs.

Suppose the log service knows the status of NM and application, then the service can do anything which we are serving now. If NM is live, the service could redirect/forward to NM's log URL, or just serve stored log file which is continuously pulled from NM. (For latter it may represent a bit outdated log, but it just depends on when to pull which is just a detail on the log service, not the thing Spark should worry about) If not, it will serve stored log file pulled before NM goes offline.

In any way, I wish we don't end up with dealing with static URL, and provide some flexibility on 3rd party and end users.

squito · 2018-12-12T21:30:38Z

OK, so while this might be useful with the JHS / ATS, @HeartSaVioR 's real use case is with some external log management system, and this works both for live apps & after the app is complete, even if the NM is gone.

Is this log aggregation system something publicly available, or something private? I'm a bit reluctant to add these configs with such limited applicability.

HeartSaVioR · 2018-12-12T21:38:25Z

It's something private for now. Btw, I feel this as egg and chicken problem. Once we open flexibility 3rd party can leverage on it. Once we close down the possibility, 3rd party will not even give a try. For example, Spark just opens possibility on leveraging dropwizard metrics, so there're many metrics sinks directly supported in Spark or they could be from 3rd party.

squito · 2018-12-13T20:50:57Z

yes, I see your point about the chicken and egg. I also wonder if this feature should not be so yarn-specific then -- in fact, it almost seems more important on kubernetes, as there is no long-lived NM there. But maybe the params you need end up being specific to deployment mode, (eg. {{ContainerId}}) so there is no general solution.

I'm inclined to wait on this a while till we see if there is a way to get this work more generally, or maybe even to work with yarn even while the app is running; but I don't feel so strongly I'm blocking it, either. @vanzin do you ahve more thoughts?

HeartSaVioR · 2018-12-14T03:01:17Z

@squito

I also wonder if this feature should not be so yarn-specific then -- in fact, it almost seems more important on kubernetes, as there is no long-lived NM there.

100% agreed, and I imagine Hadoop is taking same way as containerized, being configured in cloud and support elasticity which brings the needs on YARN side.

But maybe the params you need end up being specific to deployment mode, (eg. {{ContainerId}}) so there is no general solution.

Agreed. Path parameters should be specific to deployment mode, given that concepts can be different as well as terms/components can be different for deployment modes.

maybe even to work with yarn even while the app is running;

I guess I got same question from @vanzin and I answered it. Could you elaborate this if my answer doesn't address what you're considering?

squito · 2018-12-14T16:03:29Z

well, I thought you said above that this does not integrate that well with JHS / ATS currently, because we're not sure what will happen with running jobs. You only answered about running applications wrt your own private log service.

If this does work for a live application as well as completed applications, than I am in favor of this change.

HeartSaVioR · 2018-12-15T01:32:56Z

@squito
If we don't touch the default value it is just running as it is for now because the default URL is set to static URL of NM Spark has been using. So this doesn't break anything Spark currently has.

This patch provides flexibility of executor log URL, not only our case but also all of the cases if log URL can be represented via provided patterns. If it turns out the set of patterns are not enough in future, we can simply add missing patterns when YARN runner can provide them. (We can't predict and enumerate everything.) If there's a case YARN runner cannot provide the pattern some services need to have to provide logs, there's no way for Spark to determine and set such URL to log URL.

Regarding running application vs finished application, even NM provides unique URL and redirects URL when the application is finished. I don't think other services cannot do it, but if we have concrete case which running application and finished application should have different log URL (this would mean we may need to change log URL only for finished application) I can take a look and address it.

Does it make sense to you?

vanzin · 2018-12-17T20:46:55Z

I agree with Imran that it would be best to think about how to properly support this regardless of resource manager; including allowing applications and SHS to have different URL templates. Otherwise you end up just hardcoding a different URL; if the log server URL needs to change, all your previous applications will have broken links.

e.g. the application could save the parameters somehow in the event log, and the SHS could use those when replacing things in its own value for the log URL.

The parameters don't need to be pre-defined, each RM can have its own set, and it's up to the admin to set things up so that the URLs in the SHS make sense for their deployment.

HeartSaVioR · 2018-12-17T22:52:40Z

OK. I guess I'm seeing what we concern about. Looks like we don't want to set specific URL to env which could be a permanently dead link at any chance (like webserver is moving, we changed pattern of URL, etc.)

I feel this concern has been sitting in Spark for years (imagine the format of YARN url had to be changed in prior version) so I feel we have time to address it incrementally (doing better than current), but I also think we can investigate what to do to achieve that and just do it at once if it doesn't require huge efforts.

The parameters don't need to be pre-defined, each RM can have its own set, and it's up to the admin to set things up so that the URLs in the SHS make sense for their deployment.

I'm not sure I follow. While some parameters can be retrieved from YARN config but there're other parameters we get from allocated container (and maybe other things we can't get from YARN config). Could you please elaborate?

vanzin · 2018-12-17T22:56:43Z

While some parameters can be retrieved from YARN config but there're other parameters we get from allocated container

Yes, and that's part of what needs to be solved when thinking about a generic solution. I'm not giving you a solution, I'm saying what I would expect from a solution. How to achieve that is a different discussion.

gerashegalov · 2018-12-21T23:54:18Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

Can we do Utils.tryLog(YarnConfiguration.getClusterId(conf)).toOption

Nice suggestion. Sorry I missed this while dealing with other stuff. Will address.

I took a look at Utils.tryLog and that doesn't seem to fit my intention, since it logs error message which end users could get a wrong sign (as well as it catches exception too broadly). If RM cluster id is a mandatory config we may need to take a different approach (like fail-fast). If that is optional, I think we should not leave an error log.

HeartSaVioR · 2018-12-23T21:54:08Z

I'm OK to think about having general solution / different URLs between UI in driver / SHS, except parameters.

I have been thinking about parameters being general / flexible but still don't get it.

Generalization between resource managers don't look like possible, due to differences of term, concept, provided information, etc.
Even same resource manager, I'm not sure how much we want to be flexible. (That's the main reason I ask for elaborating.) More flexibility requires storing possibly unnecessary things (which might even contain credential one) in event log for every app. Without seeing definition of the specification and its possibility, I'm a bit worried that it can just play as a high bar/wall.

vanzin · 2019-01-02T23:35:31Z

A general solution requires two things:

the application to save some information in the event log so that the SHS can use it
the SHS to use that information to build the log URLs when building an application's UI later

The mechanism to do that doesn't need to be RM-specific. The data (i.e. the information saved in the event logs, the log URL template itself) can be, but the mechanism doesn't need to.

If you think about the code you have, you already defined a set of parameters for the log URL. All you need is, instead of parsing the log URL on the application side, do that on the SHS side, based on data written by the app. You can even only implement that for YARN, but you'd have a generic mechanism that later can also be implemented for other RMs without having to change the SHS.

SparkQA · 2019-01-04T03:18:22Z

Test build #100716 has finished for PR 23260 at commit 4c865fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-01-07T09:21:19Z

I'm afraid I'm over-thinking about this, but still don't feel easy to tackle it: not just moving logic to SHS, but also defining some interfaces and caring to run the implementation.

the application to save some information in the event log so that the SHS can use it

I think you mean exposing interface for end users to define a set of parameters for the log URL and let them be stored to event log when launching executor in resource manager. Do I understand correctly?

For specific to YARN, the application should be run in ExecutorRunnable (in AM process) which has all the information: this means we may need to care about classpath for the implementation.

I also would like to be clear about application: are you suggesting it as a part of spark application (so end users should have it in their app jar) or an individual (global) plugin (so the jar should be placed or linked in spark classpath)? I guess a Spark application should run with any resource managers, and this may open the possibility to tie the application to specific resource manager.

Btw, SPARK_LOG_URL_ prefix is only used for standalone and YARN resource manager: not sure how mesos and k8s provide executors' log URLs. May need to spend some times to get familiar with other resource managers as well.

HeartSaVioR · 2019-01-29T01:17:56Z

I'm seeing flakiness of YARN tests after rebasing, specifically after rebasing against SPARK-22404 (#19616). More failures than succeeds. I'm triggering the build multiple times, as well as running test in local dev. We should read container logs to track down the issue but I couldn't do against Jenkins build so have to deal with local env.

vanzin · 2019-01-29T02:36:02Z

I ran the tests a few times locally without your changes and they seem fine. Also couldn't find any failures on jenkins.

HeartSaVioR · 2019-01-29T02:45:18Z

Looks like in unmanaged mode some attributes are not available.

19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: 19/01/29 10:55:03 INFO YarnContainerInfoHelper: Error while building executor logs - executor logs will not be available
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: java.lang.NumberFormatException: null
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:542)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:615)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerHttpPort(YarnContainerInfoHelper.scala:105)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getLogUrls(YarnContainerInfoHelper.scala:41)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4(ExecutorRunnable.scala:240)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4$adapted(ExecutorRunnable.scala:239)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at scala.Option.foreach(Option.scala:274)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:239)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:89)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:66)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:532)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Thread.run(Thread.java:748)
19/01/29 10:55:03.993 launcher-proc-3 INFO YarnClusterDriver: 19/01/29 10:55:03 INFO YarnContainerInfoHelper: Error while retrieving executor attributes - executor logs will not be replaced with custom log pattern
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver: java.lang.NumberFormatException: null
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:542)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Integer.parseInt(Integer.java:615)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerPort(YarnContainerInfoHelper.scala:103)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.util.YarnContainerInfoHelper$.getAttributes(YarnContainerInfoHelper.scala:64)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4(ExecutorRunnable.scala:246)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.$anonfun$prepareEnvironment$4$adapted(ExecutorRunnable.scala:239)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at scala.Option.foreach(Option.scala:274)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:239)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:89)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:66)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:532)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
19/01/29 10:55:03.994 launcher-proc-3 INFO YarnClusterDriver:   at java.lang.Thread.run(Thread.java:748)

First one:

org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerHttpPort(YarnContainerInfoHelper.scala:105)

Second one:

at org.apache.spark.util.YarnContainerInfoHelper$.getNodeManagerPort(YarnContainerInfoHelper.scala:103)

which will make attribute Map being empty and fail the test. Please note that it sometimes works, not always fails.

Is it expected behavior for unmanaged mode to not have NM http port as well as NM port in system env? If then we can change the test to not check attributes. If not we need to investigate it.

HeartSaVioR · 2019-01-29T02:48:50Z

~~And very oddly, looks like unmanaged mode is running in "yarn-client" mode, but the test only verifies map when deploy mode is "cluster".~~

Never mind. I missed tracking the code line. Executor logs don't look presented. I'll find the reason.

…ntly * they're only available in container's env - cannot retrieve them outside of container process

SparkQA · 2019-01-29T05:08:54Z

Test build #101782 has finished for PR 23260 at commit 98b7b16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-29T05:15:46Z

Test build #101781 has finished for PR 23260 at commit 0df31ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-01-29T05:16:30Z

I just fixed a bug: the reason is we cannot retrieve NM_HOST / NM_PORT / NM_HTTP_PORT outside of container process for given container. In YARN cluster mode we can get them for driver, but not for executors in both cluster and client mode.

In order to roll back these attributes to NM_HTTP_ADDRESS, I had to remove explanation on setting up custom log URL to JHS URL because it refers to NM_PORT (IPC) which cannot be retrieved.

We might try moving out extracting log urls and attributes to YARN executor side and see whether it works, but I'd hope addressing that in other issue. I would like to get it done with stable shape first, and try out more.

HeartSaVioR · 2019-01-29T05:17:09Z

retest this, please

HeartSaVioR · 2019-01-29T05:17:54Z

retest this, please

HeartSaVioR · 2019-01-29T05:18:55Z

Commit trigger and two manual requests : 3 builds. Let's see how it goes. I checked YarnClusterSuite passed in local devs 3 times in a row.

SparkQA · 2019-01-29T08:05:02Z

Test build #101796 has finished for PR 23260 at commit 2d04802.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-29T08:05:02Z

Test build #101797 has finished for PR 23260 at commit 2d04802.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-01-29T08:08:01Z

Two build failures don't look relevant to YARN tests. Will retrigger tests...

HeartSaVioR · 2019-01-29T08:08:10Z

Retest this, please.

HeartSaVioR · 2019-01-29T08:08:16Z

Retest this, please.

SparkQA · 2019-01-29T12:57:50Z

Test build #101802 has finished for PR 23260 at commit 2d04802.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-01-29T17:59:58Z

resource-managers/yarn/src/main/scala/org/apache/spark/util/YarnContainerInfoHelper.scala

+
+  def getNodeManagerHttpAddress(container: Option[Container]): String = container match {
+    case Some(c) => c.getNodeHttpAddress
+    case None => System.getenv(Environment.NM_HOST.name()) + ":" +


So, your comments about the unmanaged AM reminded me of something else.

This code is always running in the same container (the AM, managed or not). So even in the managed case, this is not correct: you'd be returning the host:port for the NM running the AM, not the one that will be running the container.

Seem like we have to figure out how to have the executor build and return this data. This would solve the above problem, and also properly work with the unmanaged AM.

None is only used for cluster mode - to pop up driver's log urls as well as attributes in container. For executors we always pass the container we got assigned, and client mode is not affected by the code - if my understanding is right, unmanaged AM falls into the case (client mode) so we don't need to worry about that.

This is taking same path as as-is, just refactored. The fix is correct, in other words, before the last fix it was incorrect, and my previous comment on observation of test failure was also wrong. I might not have enough understanding of how yarn mode works. Sorry about that.

Seem like we have to figure out how to have the executor build and return this data.

I'm working on the patch - below commit passes the tests.
HeartSaVioR@3c09646

I guess we need to have another review on the new change once I apply above commit to here - if we need to go through multiple phases of reviews for new commit, could we do it in new JIRA issue/PR instead? We already put lots of efforts (review, apply, new requirements) on this PR and I want to make it just be done sooner.

Just FYI: #23706 addressed the requirement so you may find interesting.

vanzin · 2019-01-30T19:51:58Z

Ok, we can deal with unmanaged client mode separately. Took another look and it looks good.

Merging to master.

HeartSaVioR · 2019-01-30T21:09:26Z

Thanks all for detailed reviewing and merging! I'll file a new issue and raise a PR soon for executor self-retrieving information.

…cutor log URLs in SHS ## What changes were proposed in this pull request? This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN. End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.) Please refer the doc change as well as added UTs in this patch to see how to set up the feature. ## How was this patch tested? Added UT, as well as manual test with YARN cluster Closes apache#23260 from HeartSaVioR/SPARK-26311. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

vanzin reviewed Dec 10, 2018

View reviewed changes

squito reviewed Dec 11, 2018

View reviewed changes

gerashegalov reviewed Dec 21, 2018

View reviewed changes

HeartSaVioR added 4 commits January 29, 2019 09:47

Address review comments from vanzin

e4db8c5

Fix a silly bug

0acfb20

Address review comments

954b17a

More inlines...

0df31ac

HeartSaVioR force-pushed the SPARK-26311 branch from 735c2ac to 0df31ac Compare January 29, 2019 01:06

Remove unnecessary lines

98b7b16

Fix a bug: we can't support NM_HOST / NM_PORT / NM_HTTP_PORT consiste…

2d04802

…ntly * they're only available in container's env - cannot retrieve them outside of container process

vanzin reviewed Jan 29, 2019

View reviewed changes

asfgit closed this in ae5b2a6 Jan 30, 2019

HeartSaVioR deleted the SPARK-26311 branch January 30, 2019 21:09

gerashegalov mentioned this pull request Feb 1, 2019

[WIP]: create attributes from live log urls to form a permalink past YARN log aggregation #23720

Closed

	// If we are running in yarn-cluster mode, verify that driver logs links and present and are
	// in the expected format.
	if (conf.get("spark.submit.deployMode") == "cluster") {
	assert(listener.driverLogs.nonEmpty)
	val driverLogs = listener.driverLogs.get
	assert(driverLogs.size === 2)
	assert(driverLogs.contains("stderr"))
	assert(driverLogs.contains("stdout"))
	val urlStr = driverLogs("stderr")
	driverLogs.foreach { kv =>
	val log = Source.fromURL(kv._2).mkString
	assert(
	!log.contains(SECRET_PASSWORD),
	s"Driver logs contain sensitive info (${SECRET_PASSWORD}): \n${log} "
	)
	}
	val containerId = YarnSparkHadoopUtil.getContainerId
	val user = Utils.getCurrentUserName()
	assert(urlStr.endsWith(s"/node/containerlogs/$containerId/$user/stderr?start=-4096"))
	}

[SPARK-26311][CORE] New feature: apply custom log URL pattern for executor log URLs in SHS #23260

[SPARK-26311][CORE] New feature: apply custom log URL pattern for executor log URLs in SHS #23260

Uh oh!

Conversation

HeartSaVioR commented Dec 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 8, 2018

Uh oh!

srowen commented Dec 9, 2018

Uh oh!

HeartSaVioR commented Dec 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Dec 10, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Dec 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

HeartSaVioR commented Dec 11, 2018

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Dec 11, 2018

Uh oh!

HeartSaVioR commented Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

vanzin commented Dec 11, 2018

Uh oh!

HeartSaVioR commented Dec 12, 2018

Uh oh!

squito commented Dec 12, 2018

Uh oh!

vanzin commented Dec 12, 2018

Uh oh!

HeartSaVioR commented Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squito commented Dec 12, 2018

Uh oh!

HeartSaVioR commented Dec 8, 2018 •

edited

Loading

HeartSaVioR commented Dec 9, 2018 •

edited

Loading

vanzin Dec 10, 2018 •

edited

Loading

HeartSaVioR Dec 11, 2018 •

edited

Loading

HyukjinKwon Dec 11, 2018 •

edited

Loading

HeartSaVioR commented Dec 11, 2018 •

edited

Loading

HeartSaVioR commented Dec 12, 2018 •

edited

Loading

HeartSaVioR commented Dec 12, 2018 •

edited

Loading

HeartSaVioR commented Dec 14, 2018 •

edited

Loading

HeartSaVioR Jan 16, 2019 •

edited

Loading

HeartSaVioR commented Jan 29, 2019 •

edited

Loading

HeartSaVioR commented Jan 29, 2019 •

edited

Loading

HeartSaVioR commented Jan 29, 2019 •

edited

Loading