-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26311][CORE] New feature: apply custom log URL pattern for executor log URLs in SHS #23260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #99857 has finished for PR 23260 at commit
|
|
If you're on YARN, this feels like something you would manage via YARN and its cluster management options. Is there a specific use case here, that this has to happen in Spark? |
|
@srowen This situation can be happen when decommission happens a bit frequently, or when end users want a kind of elasticity against YARN cluster (not only decommissioning nodes, but also elasticity on YARN cluster itself - YARN has cluster id for RM which classifies the cluster which can be leveraged when dealing with multiple YARN clusters.) There's also similar change applied on Hadoop side. We are experimenting central log service which resolves above situation. At least the log url for centralized log service can't be same URL as NM webapp, we have to get flexibility of executor log URL. Hope it explains the rationalization well. |
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have unit tests.
docs/running-on-yarn.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the full address of the NM HTTP server? Because it feels a little conflicting with the above variable.
Also I'm not sure what "node on container" means. Perhaps "node where container was run".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. It is host:port instead of URI which can be retrieved from container.getNodeHttpAddress. The representation of node on container is borrowed from javadoc of this method, but I'm OK to use anything more clarified.
Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be converted to the default value of the log configuration? Seems like all the variables here match the ones you're using there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will remove the branch. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants are only used in the methods below. Also, the methods below are only called from a single place.
Seems to me you should have a single method that implements all this logic. You could also avoid this new object, for the same reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. I'm in favor of avoiding to use string constant directly, but not strong opinion on it. Will address.
And yes I can put them in a single method, but placing a new method into class will bring unnecessary burden to the test code, since ExecutorRunnable receives lots of parameters to be instantiated.
If we want to add an end-to-end test (instantiating YARN cluster and running executors) we still need to instantiate ExecutorRunnable (I think we are already covering it from here [1]), but if we just want to make sure the logic works properly, we might want to keep this as new object and add a test against the object to avoid instantiating ExecutorRunnable. WDYT?
spark/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
Lines 442 to 461 in 05cf81e
// If we are running in yarn-cluster mode, verify that driver logs links and present and are // in the expected format. if (conf.get("spark.submit.deployMode") == "cluster") { assert(listener.driverLogs.nonEmpty) val driverLogs = listener.driverLogs.get assert(driverLogs.size === 2) assert(driverLogs.contains("stderr")) assert(driverLogs.contains("stdout")) val urlStr = driverLogs("stderr") driverLogs.foreach { kv => val log = Source.fromURL(kv._2).mkString assert( !log.contains(SECRET_PASSWORD), s"Driver logs contain sensitive info (${SECRET_PASSWORD}): \n${log} " ) } val containerId = YarnSparkHadoopUtil.getContainerId val user = Utils.getCurrentUserName() assert(urlStr.endsWith(s"/node/containerlogs/$containerId/$user/stderr?start=-4096")) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the Spark style for multi-line method declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's allowed when it fits within two-lines, but no problem to change it. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly, the guide line itself does not explicitly mention about two lines - it says it's okay if it fits within 2 lines (see databricks/scala-style-guide#64 (comment)). I intentionally avoided this because some of codes in some components do not comply this.
However, strictly we should better stick to two spaces indentation whenever possible per https://github.com/databricks/scala-style-guide#indent
Use 2-space indentation in general.
|
Test build #99955 has finished for PR 23260 at commit
|
|
@vanzin Thanks for the detailed review! Addressed review comments. |
squito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the point of this. The yarn change you mentioned, apache/hadoop@5fe1dbf / https://issues.apache.org/jira/browse/YARN-8964, only adds support for an optional clusterId query param, which seems different from this.
To make sure I understand correctly, by itself this is just allowing you set a different URL where the logs will be available -- but it doesn't do anything to actually make those logs available anywhere else. I guess you have other changes in your setup to have those logs written somewhere else in the first place? I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4-space indent for method parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I just confused indent rule between method parameters and return... Nice catch. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double-indent (4 spaces) the continuation line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address.
My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on. That does open up some questions though. The code being modified is in the AM, which means that the user needs to opt into this when he submits the app, when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful. I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal. |
Yes, exactly.
Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of
I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with. |
|
Test build #99992 has finished for PR 23260 at commit
|
|
This is the other PR I was referring to: #20326 |
|
@vanzin |
|
I'm sorry, I still don't totally follow, partially from my ignorance of some details in yarn here which I'd like to better understand. Marcelo's comment sounds like a good use of this:
So this change would allow the user to do that, if they change these parameters when they submit their application? It sounds like with this patch, if you configure things correctly, you can get the JHS / ATS to display those logs (I guess this works using yarn's log aggregation?). If so, it might make sense to include a description of how you'd configure things that way in the docs (it isn't obvious to me how to do it, anyway). Or maybe I'm still not quite following, and there is some 3rd party piece here, outside of spark & yarn, which collects the logs and can serve them later on, whether or not the NM can serve the logs? |
Right, and that's the part that worries me. Both because the user has to do that (well, the admin could put the values in the default Spark properties file), and also because I'm not sure about what's the behavior while the app is running. If you go to the live UI, and click on the log link, where does that take you? |
Exactly, and collecting logs can also happen while app is running. For now I would rather say there's 3rd party here, but Hadoop side is trying to leverage
In practice, Admin will put the value in Spark properties. I agree it doesn't sound good if end users can override it, but not sure Spark can prevent it. Please let me know if there's a way for Spark to only read from Spark property file and not allowing end users to override it while submitting. I'm not aware of it and I'll use once it exists.
Centralized log services (whatever they exist) will provide the log in unique URLs and Spark will always point to these URLs. Suppose the log service knows the status of NM and application, then the service can do anything which we are serving now. If NM is live, the service could redirect/forward to NM's log URL, or just serve stored log file which is continuously pulled from NM. (For latter it may represent a bit outdated log, but it just depends on when to pull which is just a detail on the log service, not the thing Spark should worry about) If not, it will serve stored log file pulled before NM goes offline. In any way, I wish we don't end up with dealing with static URL, and provide some flexibility on 3rd party and end users. |
|
OK, so while this might be useful with the JHS / ATS, @HeartSaVioR 's real use case is with some external log management system, and this works both for live apps & after the app is complete, even if the NM is gone. Is this log aggregation system something publicly available, or something private? I'm a bit reluctant to add these configs with such limited applicability. |
|
It's something private for now. Btw, I feel this as egg and chicken problem. Once we open flexibility 3rd party can leverage on it. Once we close down the possibility, 3rd party will not even give a try. For example, Spark just opens possibility on leveraging dropwizard metrics, so there're many metrics sinks directly supported in Spark or they could be from 3rd party. |
|
yes, I see your point about the chicken and egg. I also wonder if this feature should not be so yarn-specific then -- in fact, it almost seems more important on kubernetes, as there is no long-lived NM there. But maybe the params you need end up being specific to deployment mode, (eg. I'm inclined to wait on this a while till we see if there is a way to get this work more generally, or maybe even to work with yarn even while the app is running; but I don't feel so strongly I'm blocking it, either. @vanzin do you ahve more thoughts? |
100% agreed, and I imagine Hadoop is taking same way as containerized, being configured in cloud and support elasticity which brings the needs on YARN side.
Agreed. Path parameters should be specific to deployment mode, given that concepts can be different as well as terms/components can be different for deployment modes.
I guess I got same question from @vanzin and I answered it. Could you elaborate this if my answer doesn't address what you're considering? |
|
well, I thought you said above that this does not integrate that well with JHS / ATS currently, because we're not sure what will happen with running jobs. You only answered about running applications wrt your own private log service. If this does work for a live application as well as completed applications, than I am in favor of this change. |
|
@squito This patch provides flexibility of executor log URL, not only our case but also all of the cases if log URL can be represented via provided patterns. If it turns out the set of patterns are not enough in future, we can simply add missing patterns when YARN runner can provide them. (We can't predict and enumerate everything.) If there's a case YARN runner cannot provide the pattern some services need to have to provide logs, there's no way for Spark to determine and set such URL to log URL. Regarding running application vs finished application, even NM provides unique URL and redirects URL when the application is finished. I don't think other services cannot do it, but if we have concrete case which running application and finished application should have different log URL (this would mean we may need to change log URL only for finished application) I can take a look and address it. Does it make sense to you? |
|
I agree with Imran that it would be best to think about how to properly support this regardless of resource manager; including allowing applications and SHS to have different URL templates. Otherwise you end up just hardcoding a different URL; if the log server URL needs to change, all your previous applications will have broken links. e.g. the application could save the parameters somehow in the event log, and the SHS could use those when replacing things in its own value for the log URL. The parameters don't need to be pre-defined, each RM can have its own set, and it's up to the admin to set things up so that the URLs in the SHS make sense for their deployment. |
|
OK. I guess I'm seeing what we concern about. Looks like we don't want to set specific URL to env which could be a permanently dead link at any chance (like webserver is moving, we changed pattern of URL, etc.) I feel this concern has been sitting in Spark for years (imagine the format of YARN url had to be changed in prior version) so I feel we have time to address it incrementally (doing better than current), but I also think we can investigate what to do to achieve that and just do it at once if it doesn't require huge efforts.
I'm not sure I follow. While some parameters can be retrieved from YARN config but there're other parameters we get from allocated container (and maybe other things we can't get from YARN config). Could you please elaborate? |
Yes, and that's part of what needs to be solved when thinking about a generic solution. I'm not giving you a solution, I'm saying what I would expect from a solution. How to achieve that is a different discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do Utils.tryLog(YarnConfiguration.getClusterId(conf)).toOption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. Sorry I missed this while dealing with other stuff. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at Utils.tryLog and that doesn't seem to fit my intention, since it logs error message which end users could get a wrong sign (as well as it catches exception too broadly). If RM cluster id is a mandatory config we may need to take a different approach (like fail-fast). If that is optional, I think we should not leave an error log.
|
I'm OK to think about having general solution / different URLs between UI in driver / SHS, except parameters. I have been thinking about parameters being general / flexible but still don't get it.
|
|
A general solution requires two things:
The mechanism to do that doesn't need to be RM-specific. The data (i.e. the information saved in the event logs, the log URL template itself) can be, but the mechanism doesn't need to. If you think about the code you have, you already defined a set of parameters for the log URL. All you need is, instead of parsing the log URL on the application side, do that on the SHS side, based on data written by the app. You can even only implement that for YARN, but you'd have a generic mechanism that later can also be implemented for other RMs without having to change the SHS. |
|
Test build #100716 has finished for PR 23260 at commit
|
|
I'm afraid I'm over-thinking about this, but still don't feel easy to tackle it: not just moving logic to SHS, but also defining some interfaces and caring to run the implementation.
I think you mean For specific to YARN, I also would like to be clear about Btw, |
735c2ac to
0df31ac
Compare
|
I'm seeing flakiness of YARN tests after rebasing, specifically after rebasing against SPARK-22404 (#19616). More failures than succeeds. I'm triggering the build multiple times, as well as running test in local dev. We should read container logs to track down the issue but I couldn't do against Jenkins build so have to deal with local env. |
|
I ran the tests a few times locally without your changes and they seem fine. Also couldn't find any failures on jenkins. |
|
Looks like in unmanaged mode some attributes are not available. First one: Second one: which will make attribute Map being empty and fail the test. Please note that it sometimes works, not always fails. Is it expected behavior for unmanaged mode to not have NM http port as well as NM port in system env? If then we can change the test to not check attributes. If not we need to investigate it. |
|
Never mind. I missed tracking the code line. Executor logs don't look presented. I'll find the reason. |
…ntly * they're only available in container's env - cannot retrieve them outside of container process
|
Test build #101782 has finished for PR 23260 at commit
|
|
Test build #101781 has finished for PR 23260 at commit
|
|
I just fixed a bug: the reason is we cannot retrieve NM_HOST / NM_PORT / NM_HTTP_PORT outside of container process for given container. In YARN cluster mode we can get them for driver, but not for executors in both cluster and client mode. In order to roll back these attributes to NM_HTTP_ADDRESS, I had to remove explanation on setting up custom log URL to JHS URL because it refers to NM_PORT (IPC) which cannot be retrieved. We might try moving out extracting log urls and attributes to YARN executor side and see whether it works, but I'd hope addressing that in other issue. I would like to get it done with stable shape first, and try out more. |
|
retest this, please |
1 similar comment
|
retest this, please |
|
Commit trigger and two manual requests : 3 builds. Let's see how it goes. I checked YarnClusterSuite passed in local devs 3 times in a row. |
|
Test build #101796 has finished for PR 23260 at commit
|
|
Test build #101797 has finished for PR 23260 at commit
|
|
Two build failures don't look relevant to YARN tests. Will retrigger tests... |
|
Retest this, please. |
1 similar comment
|
Retest this, please. |
|
Test build #101802 has finished for PR 23260 at commit
|
|
|
||
| def getNodeManagerHttpAddress(container: Option[Container]): String = container match { | ||
| case Some(c) => c.getNodeHttpAddress | ||
| case None => System.getenv(Environment.NM_HOST.name()) + ":" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, your comments about the unmanaged AM reminded me of something else.
This code is always running in the same container (the AM, managed or not). So even in the managed case, this is not correct: you'd be returning the host:port for the NM running the AM, not the one that will be running the container.
Seem like we have to figure out how to have the executor build and return this data. This would solve the above problem, and also properly work with the unmanaged AM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None is only used for cluster mode - to pop up driver's log urls as well as attributes in container. For executors we always pass the container we got assigned, and client mode is not affected by the code - if my understanding is right, unmanaged AM falls into the case (client mode) so we don't need to worry about that.
This is taking same path as as-is, just refactored. The fix is correct, in other words, before the last fix it was incorrect, and my previous comment on observation of test failure was also wrong. I might not have enough understanding of how yarn mode works. Sorry about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem like we have to figure out how to have the executor build and return this data.
I'm working on the patch - below commit passes the tests.
HeartSaVioR@3c09646
I guess we need to have another review on the new change once I apply above commit to here - if we need to go through multiple phases of reviews for new commit, could we do it in new JIRA issue/PR instead? We already put lots of efforts (review, apply, new requirements) on this PR and I want to make it just be done sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI: #23706 addressed the requirement so you may find interesting.
|
Ok, we can deal with unmanaged client mode separately. Took another look and it looks good. Merging to master. |
|
Thanks all for detailed reviewing and merging! I'll file a new issue and raise a PR soon for executor self-retrieving information. |
…cutor log URLs in SHS ## What changes were proposed in this pull request? This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN. End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.) Please refer the doc change as well as added UTs in this patch to see how to set up the feature. ## How was this patch tested? Added UT, as well as manual test with YARN cluster Closes apache#23260 from HeartSaVioR/SPARK-26311. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
What changes were proposed in this pull request?
This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.
End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)
Please refer the doc change as well as added UTs in this patch to see how to set up the feature.
How was this patch tested?
Added UT, as well as manual test with YARN cluster