-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI #7753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #38858 has finished for PR 7753 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not maxNettyOnHeapSizeTime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better create a named case class like:
case class NettyMemoryMetric(value: Long, currentTime: Long)
To replace this returned Tuple. I think it will be more clear to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use maxNettyReadOnheapSizeTime to make the metrics more detail, since there is shuffle write Netty part (the stranportServer part), the maxNettyOnHeapSizeTime should be the sum of both side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @jerryshao .
(a) the time here is just a timestamp that we got the max, it doesn't make sense to take a sum of them, and the case class will make it more clear. but instead of currentTime I'd use timestamp
(b) I'm not sure if there is any value in including "read" in the name. its true that netty is mostly using this memory for reading data off the network, but we're not actually doing anything special to make sure we only count memory used during read. We are just counting the total memory used by netty, and it so happens that the write-side is streaming. (I think I may have been confused during earlier discussions -- the write-side uses FileSegmentManagedBuffer.convertToNetty which should by streaming.) We should mention in the docs that the primary source of this memory is from the shuffle-read, that netty will needs memory for the blocks fetched from each executor thread, etc. But I think we can only guarantee that we get statistics on netty's memory usage in general, we can't necessarily separate out the write-side.
|
A simple question, is it enough to only expose the maximum memory usage of Netty layer? Besides, IMHO I think it would be better to separate getting Netty memory usage and displaying in web portal into two different PR. |
|
@jerryshao , thanks for your review and feedback, I'll separate it into different PRs. And I will keep this PR open and waiting for others to see whether they have different thoughts on the whole implementation. |
|
@squito mind helping to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the "right" port -- eg., its not the port that users will see in the UI elsewhere for the executor. I think that will be the port in env.address.port (that is what is used in the onExecutorAdded event).
|
Jenkins, retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ExecutorMetrics is imported twice
|
Since we'll eventually want to add more metrics, can you put all the netty metrics into another case class inside Also, I'm wondering if we want to use "netty" in the name -- I think most users won't know or care about netty in particular. It should it just be named "network" or "transport", and the nio implementation should indicate that metrics are missing. I guess altogether this means doing something like: class ExecutorMetrics {
var transportMetrics: Option[TransportMetrics] = ...
} |
|
@jerryshao I'm not entirely sure I know what you mean by:
can you elaborate? Obviously we'd always like more metrics, but are you saying this isn't useful? |
|
Test build #164 has finished for PR 7753 at commit
|
|
Test build #39044 has finished for PR 7753 at commit
|
|
Hi @squito , IIUC here the code always set the max memory usage: If current memory usage is smaller than the previous one, simple will discard it. From my understanding it is better to transfer the raw metrics to the driver, and further processed by each listener as they wanted, that will be more flexible. If people want the minimum memory usage for example, they could process it by their logic in listener. Here only get the max memory usage will restrict the flexibility for different user. |
|
@jerryshao I don't think that's true. The executors metrics that get sent back to the driver have both the max value, and the current value. So a custom listener could already track the minimum memory usage if it wanted. I wonder if we should even calculate the max at all inside the executor, or just leave it to the listeners. On one hand, max is what you will usually want, so might as well make it more convenient. OTOH, maybe it would be cleaner to leave it to the listeners ... eg. I can already imagine in the future we may want to track max memory usage during each stage, and I definitely think that logic should be left to the listeners. what do you think @liyezhang556520 ? I will think about it more, but now I'm leaning towards just leaving max to the listeners ... |
17e5b97 to
7491279
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly higher-level question -- is anything gained by breaking this into the different test cases? I feel like you could just take this final test case, add some more stages at the end with no metric updates, and also add some asserts on listener.activeStagesToMem etc., and that would cover all the other test cases.
I'm all for having more complete coverage with more test cases, but not sure we gain anything here. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think we should separate several cases so that we can make sure all situations covered. Yes actually, some of the current cases has some duplicated parts, I'll try to remove some.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is wrong (off by one). maybe just make comment "one extra line for SparkListenerLogStart"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, forget to update the comments, thanks.
|
Test build #47305 has finished for PR 7753 at commit
|
|
Test build #47415 has finished for PR 7753 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than putting this method here, just define it inside the test case where you use it (though I think you probably don't even need it there)
|
One thought here: it'd probably be nice to have a json history with the new events as part of the history server suite regression tests —that'll catch any changes to the new events which will break backwards compatibility |
|
Test build #49235 has finished for PR 7753 at commit
|
|
Hi @steveloughran , thank you for your attention and your comments, and your further comments on this PR are pretty appreciated. Your advice is quite correct, we need to make a regression test for history server suite. I think this is related with what @squito mentioned that we need to add json api support. I'm wondering whether we can make it done in another PR after this PR be able to merge or this feature totally accepted by the community. And of course, rest api support is supposed to be one part of this feature. |
|
Test build #49239 has finished for PR 7753 at commit
|
|
@liyezhang556520 can you remove the "WIP" from the title. cc @andrewor14 |
|
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket. Note - I actually think this is a pretty useful feature, and in the future maybe we should switch over to consolidate memory management entirely with Netty. |
This is a sub-task of SPARK-9103, we'd like to expose the memory usage for spark running time, this is the first step to expose the netty buffer used both with on-heap and off-heap memory. Also the metrics are showed on WebUI. In this PR, a new web Tab name
Memoryis added. Which is used to show the memory usage of each executors (can be in more details in future). the screenshot is like the following:This is for each stages memory info:

This is History View:

This is WIP because with unit tests left.